CodeQL library for Python
codeql/python-all 2.0.0 (changelog, source)
Search

Module TaintTracking

Python Taint Tracking Library

The taint tracking library is described in three parts.

  1. Specification of kinds, sources, sinks and flows.
  2. The high level query API
  3. The implementation.

Specification

There are four parts to the specification of a taint tracking query. These are:

  1. Kinds

    The Python taint tracking library supports arbitrary kinds of taint. This is useful where you want to track something related to “taint”, but that is in itself not dangerous. For example, we might want to track the flow of request objects. Request objects are not in themselves tainted, but they do contain tainted data. For example, the length or timestamp of a request may not pose a risk, but the GET or POST string probably do. So, we would want to track request objects distinctly from the request data in the GET or POST field.

    Kinds can also specify additional flow steps, but we recommend using the DataFlowExtension module, which is less likely to cause issues with unwanted recursion.

  2. Sources

    Sources of taint can be added by importing a predefined sub-type of TaintSource, or by defining new ones.

  3. Sinks (or vulnerabilities)

    Sinks can be added by importing a predefined sub-type of TaintSink, or by defining new ones.

  4. Flow extensions

    Additional flow can be added by importing predefined sub-types of DataFlowExtension::DataFlowNode or DataFlowExtension::DataFlowVariable or by defining new ones.

The high-level query API

The TaintedNode fully describes the taint flow graph. The full graph can be expressed as:

from TaintedNode n, TaintedNode s
where s = n.getASuccessor()
select n, s

The source -> sink relation can be expressed either using TaintedNode:

from TaintedNode src, TaintedNode sink
where src.isSource() and sink.isSink() and src.getASuccessor*() = sink
select src, sink

or, using the specification API:

from TaintSource src, TaintSink sink
where src.flowsToSink(sink)
select src, sink

The implementation

The data-flow graph used by the taint-tracking library is the one created by the points-to analysis, and consists of the base data-flow graph defined in semmle/python/essa/Essa.qll enhanced with precise variable flows, call graph and type information. This graph is then enhanced with additional flows as specified above. Since the call graph and points-to information is context sensitive, the taint graph must also be context sensitive.

The taint graph is a directed graph where each node consists of a (CFG node, context, taint) triple although it could be thought of more naturally as a number of distinct graphs, one for each input taint-kind consisting of data flow nodes, (CFG node, context) pairs, labelled with their taint.

The TrackedValue used in the implementation is not the taint kind specified by the user, but describes both the kind of taint and how that taint relates to any object referred to by a data-flow graph node or edge. Currently, only two types of taint are supported: simple taint, where the object is actually tainted; and attribute taint where a named attribute of the referred object is tainted.

Support for tainted members (both specific members of tuples and the like, and generic members for mutable collections) are likely to be added in the near future and other forms are possible. The types of taints are hard-wired with no user-visible extension method at the moment.

Import path

import semmle.python.dataflow.old.TaintTracking

Imports

Classes

CollectionKind

Taint kinds representing collections of other taint kind. We use {kind} to represent a mapping of string to kind and [kind] to represent a flat collection of kind. The use of { and [ is chosen to reflect dict and list literals in Python. We choose a single character prefix and suffix for simplicity and ease of preventing infinite recursion.

DictKind

A taint kind representing a mapping of objects to kinds. Typically a dict, but can include other mappings.

Sanitizer

A type of sanitizer of untrusted data. Examples include sanitizers for http responses, for DB access or for shell commands. Usually a sanitizer can only sanitize data for one particular use. For example, a sanitizer for DB commands would not be safe to use for http responses.

SequenceKind

A taint kind representing a flat collections of kinds. Typically a sequence, but can include sets.

TaintKind

A ‘kind’ of taint. This may be almost anything, but it is typically something like a “user-defined string”. Examples include, data from a http request object, data from an SMS or other mobile data source, or, for a super secure system, environment variables or the local file system.

TaintSink

A node that is vulnerable to one or more types of taint. These nodes provide the sinks when computing the taint flow graph. An example would be an argument to a write to a http response object, such an argument would be vulnerable to unsanitized user-input (XSS).

TaintSource

A source of taintedness. Users of the taint tracking library should override this class to provide their own sources.

TaintedDefinition

Warning: Advanced feature. Users are strongly recommended to use TaintSource instead. A source of taintedness on the ESSA data-flow graph. Users of the taint tracking library can override this class to provide their own sources on the ESSA graph.

TaintedPathSink
TaintedPathSource

Modules

DataFlow

Data flow module providing an interface compatible with the other language implementations.

DataFlowExtension

Extension for data-flow, to help express data-flow paths that are library or framework specific and cannot be inferred by the general data-flow machinery.

DictKind
SequenceKind

Aliases

FlowLabel

An Alias of TaintKind, so the two types can be used interchangeably.

TaintedNode

A class representing the (node, context, path, kind) tuple. Used for context-sensitive path-aware taint-tracking.