Module IterableUnpacking

The unpacking assignment takes the general form

  sequence = iterable

where sequence is either a tuple or a list and it can contain wildcards. The iterable can be any iterable, which means that (CodeQL modeling of) content will need to change type if it should be transferred from the LHS to the RHS.

Note that (CodeQL modeling of) content does not have to change type on data-flow paths inside the LHS, as the different allowed syntaxes here are merely a convenience. Consequently, we model all LHS sequences as tuples, which have the more precise content model, making flow to the elements more precise. If an element is a starred variable, we will have to mutate the content type to be list content.

We may for instance have

   (a, b) = ["a", SOURCE]  # RHS has content `ListElementContent`

Due to the abstraction for list content, we do not know whether SOURCE ends up in a or in b, so we want to overapproximate and see it in both.

Using wildcards we may have

  (a, *b) = ("a", "b", SOURCE)  # RHS has content `TupleElementContent(2)`

Since the starred variables are always assigned (Python-)type list, *b will be ["b", SOURCE], and we will again overapproximate and assign it content corresponding to anything found in the RHS.

For a precise transfer

   (a, b) = ("a", SOURCE)  # RHS has content `TupleElementContent(1)`

we wish to keep the precision, so only b receives the tuple content at index 1.

Finally, sequence is actually a pattern and can have a more complicated structure, such as

  (a, [b, *c]) = ("a", ["b", SOURCE])  # RHS has content `TupleElementContent(1); ListElementContent`

where a should not receive content, but b and c should. c will be [SOURCE] so should have the content transferred, while b should read it.

To transfer content from RHS to the elements of the LHS in the expression sequence = iterable, we use two synthetic nodes:

TIterableSequence(sequence) which captures the content-modeling the entire sequence will have (essentially just a copy of the content-modeling the RHS has)
TIterableElement(sequence) which captures the content-modeling that will be assigned to an element. Note that an empty access path means that the value we are tracking flows directly to the element.

The TIterableSequence(sequence) is at this point superfluous but becomes useful when handling recursive structures in the LHS, where sequence is some internal sequence node. We can have a uniform treatment by always having these two synthetic nodes. So we transfer to (or, in the recursive case, read into) TIterableSequence(sequence), from which we take a read step to TIterableElement(sequence) and then a store step to sequence.

This allows the unknown content from the RHS to be read into TIterableElement(sequence) and tuple content to then be stored into sequence. If the content is already tuple content, this indirection creates crosstalk between indices. Therefore, tuple content is never read into TIterableElement(sequence); it is instead transferred directly from TIterableSequence(sequence) to sequence via a flow step. Such a flow step will also transfer other content, but only tuple content is further read from sequence into its elements.

The strategy is then via several read-, store-, and flow steps:

a) [Flow] Content is transferred from iterable to TIterableSequence(sequence) via a flow step. From here, everything happens on the LHS.

b) [Read] If the unpacking happens inside a for as in
```
   for sequence in iterable
```
then content is read from iterable to TIterableSequence(sequence).
[Flow] Content is transferred from TIterableSequence(sequence) to sequence via a flow step. (Here only tuple content is relevant.)
[Read] Content is read from TIterableSequence(sequence) into TIterableElement(sequence). As sequence is modeled as a tuple, we will not read tuple content as that would allow crosstalk.
[Store] Content is stored from TIterableElement(sequence) to sequence. Content type is TupleElementContent with indices taken from the syntax. For instance, if sequence is (a, *b, c), content is written to index 0, 1, and 2. This is adequate as the route through TIterableElement(sequence) does not transfer precise content.
[Read] Content is read from sequence to its elements. a) If the element is a plain variable, the target is the corresponding control flow node.

b) If the element is itself a sequence, with control-flow node seq, the target is TIterableSequence(seq).

c) If the element is a starred variable, with control-flow node v, the target is TIterableElement(v).
[Store] Content is stored from TIterableElement(v) to the control flow node for variable v, with content type ListElementContent.
[Flow, Read, Store] Steps 2 through 7 are repeated for all recursive elements which are sequences.

We illustrate the above steps on the assignment

(a, b) = ["a", SOURCE]

Looking at the content propagation to a: ["a", SOURCE]: [ListElementContent]

–Step 1a–>

TIterableSequence((a, b)): [ListElementContent]

–Step 3–>

TIterableElement((a, b)): []

–Step 4–>

(a, b): [TupleElementContent(0)]

–Step 5a–>

a: []

Meaning there is data-flow from the RHS to a (an over approximation). The same logic would be applied to show there is data-flow to b. Note that Step 3 and Step 4 would not have been needed if the RHS had been a tuple (since that would have been able to use Step 2 instead).

Another, more complicated example:

  (a, [b, *c]) = ["a", [SOURCE]]

where the path to c is

["a", [SOURCE]]: [ListElementContent; ListElementContent]

–Step 1a–>

TIterableSequence((a, [b, *c])): [ListElementContent; ListElementContent]

–Step 3–>

TIterableElement((a, [b, *c])): [ListElementContent]

–Step 4–>

(a, [b, *c]): [TupleElementContent(1); ListElementContent]

–Step 5b–>

TIterableSequence([b, *c]): [ListElementContent]

–Step 3–>

TIterableElement([b, *c]): []

–Step 4–>

[b, *c]: [TupleElementContent(1)]

–Step 5c–>

TIterableElement(c): []

–Step 6–>

c: [ListElementContent]

Import path

import semmle.python.dataflow.new.internal.IterableUnpacking

Predicates

iterableUnpackingAssignmentFlowStep	Step 1a Data flows from `iterable` to `TIterableSequence(sequence)`
iterableUnpackingConvertingReadStep	Step 3 Data flows from `TIterableSequence(sequence)` into `TIterableElement(sequence)`. As `sequence` is modeled as a tuple, we will not read tuple content as that would allow crosstalk.
iterableUnpackingConvertingStoreStep	Step 4 Data flows from `TIterableElement(sequence)` to `sequence`. Content type is `TupleElementContent` with indices taken from the syntax. For instance, if `sequence` is `(a, *b, c)`, content is written to index 0, 1, and 2.
iterableUnpackingElementReadStep	Step 5 For a sequence node inside an iterable unpacking, data flows from the sequence to its elements. There are three cases for what `toNode` should be: a) If the element is a plain variable, `toNode` is the corresponding control flow node.
iterableUnpackingFlowStep	All flow steps associated with unpacking assignment.
iterableUnpackingForReadStep	Step 1b Data is read from `iterable` to `TIterableSequence(sequence)`
iterableUnpackingReadStep	All read steps associated with unpacking assignment.
iterableUnpackingStarredElementStoreStep	Step 6 Data flows from `TIterableElement(v)` to the control flow node for variable `v`, with content type `ListElementContent`.
iterableUnpackingStoreStep	All store steps associated with unpacking assignment.
iterableUnpackingTupleFlowStep	Step 2 Data flows from `TIterableSequence(sequence)` to `sequence`

Classes

AssignmentTarget	The LHS of an assignment, it also records the assigned value.
ForTarget	The target of a `for`, e.g. `x` in `for x in list` or in `[42 for x in list]`. This class also records the source, which in both above cases is `list`. This class abstracts away the differing representations of comprehensions and for statements.
UnpackingAssignmentDirectTarget	A direct (or top-level) target of an unpacking assignment.
UnpackingAssignmentSequenceTarget	A (possibly recursive) target of an unpacking assignment which is also a sequence.
UnpackingAssignmentTarget	A (possibly recursive) target of an unpacking assignment.