CodeQL library for Python¶
When you need to analyze a Python program, you can make use of the large collection of classes in the CodeQL library for Python.
About the CodeQL library for Python¶
The CodeQL library for each programming language uses classes with abstractions and predicates to present data in an object-oriented form.
Each CodeQL library is implemented as a set of QL modules, that is, files with the extension .qll
. The module python.qll
imports all the core Python library modules, so you can include the complete library by beginning your query with:
import python
The CodeQL library for Python incorporates a large number of classes. Each class corresponds either to one kind of entity in Python source code or to an entity that can be derived from the source code using static analysis. These classes can be divided into four categories:
- Syntactic - classes that represent entities in the Python source code.
- Control flow - classes that represent entities from the control flow graphs.
- Data flow - classes that represent entities from the data flow graphs.
- API graphs - classes that represent entities from the API graphs.
The first two categories are described below. For a description of data flow and associated classes, see “Analyzing data flow in Python”. For a description of API graphs and their use, see “Using API graphs in Python.”
Syntactic classes¶
This part of the library represents the Python source code. The Module
, Class
, and Function
classes correspond to Python modules, classes, and functions respectively, collectively these are known as Scope
classes. Each Scope
contains a list of statements each of which is represented by a subclass of the class Stmt
. Statements themselves can contain other statements or expressions which are represented by subclasses of Expr
. Finally, there are a few additional classes for the parts of more complex expressions such as list comprehensions. Collectively these classes are subclasses of AstNode
and form an Abstract syntax tree (AST). The root of each AST is a Module
. Symbolic information is attached to the AST in the form of variables (represented by the class Variable
). For more information, see Abstract syntax tree and Symbolic information on Wikipedia.
Scope¶
A Python program is a group of modules. Technically a module is just a list of statements, but we often think of it as composed of classes and functions. These top-level entities, the module, class, and function are represented by the three CodeQL classes Module, Class and Function which are all subclasses of Scope
.
Scope
Module
Class
Function
All scopes are basically a list of statements, although Scope
classes have additional attributes such as names. For example, the following query finds all functions whose scope (the scope in which they are declared) is also a function:
import python
from Function f
where f.getScope() instanceof Function
select f
Many codebases use nested functions.
Statement¶
A statement is represented by the Stmt class which has about 20 subclasses representing the various kinds of statements, such as the Pass
statement, the Return
statement or the For
statement. Statements are usually made up of parts. The most common of these is the expression, represented by the Expr
class. For example, take the following Python for
statement:
for var in seq:
pass
else:
return 0
The For class representing the for
statement has a number of member predicates to access its parts:
getTarget()
returns theExpr
representing the variablevar
.getIter()
returns theExpr
resenting the variableseq
.getBody()
returns the statement list body.getStmt(0)
returns the passStmt
.getOrElse()
returns theStmtList
containing the return statement.
Expression¶
Most statements are made up of expressions. The Expr class is the superclass of all expression classes, of which there are about 30 including calls, comprehensions, tuples, lists and arithmetic operations. For example, the Python expression a+2
is represented by the BinaryExpr
class:
getLeft()
returns theExpr
representing thea
.getRight()
returns theExpr
representing the2
.
As an example, to find expressions of the form a+2
where the left is a simple name and the right is a numeric constant we can use the following query:
Finding expressions of the form “a+2”
import python
from BinaryExpr bin
where bin.getLeft() instanceof Name and bin.getRight() instanceof Num
select bin
Many codebases include examples of this pattern.
Variable¶
Variables are represented by the Variable class in the CodeQL library. There are two subclasses, LocalVariable
for function-level and class-level variables and GlobalVariable
for module-level variables.
Other source code elements¶
Although the meaning of the program is encoded by the syntactic elements, Scope
, Stmt
and Expr
there are some parts of the source code not covered by the abstract syntax tree. The most useful of these is the Comment class which describes comments in the source code.
Examples¶
Each syntactic element in Python source is recorded in the CodeQL database. These can be queried via the corresponding class. Let us start with a couple of simple examples.
1. Finding all finally
blocks¶
For our first example, we can find all finally
blocks by using the Try
class:
Find all finally
blocks
import python
from Try t
select t.getFinalbody()
Many codebases include examples of this pattern.
2. Finding except
blocks that do nothing¶
For our second example, we can use a simplified version of a query from the standard query set. We look for all except
blocks that do nothing.
A block that does nothing is one that contains no statements except pass
statements. We can encode this as:
not exists(Stmt s | s = ex.getAStmt() | not s instanceof Pass)
where ex
is an ExceptStmt
and Pass
is the class representing pass
statements. Instead of using the double negative, “no statements that are not pass statements”, this can also be expressed positively, “all statements must be pass statements.” The positive form is expressed using the forall
quantifier:
forall(Stmt s | s = ex.getAStmt() | s instanceof Pass)
Both forms are equivalent. Using the positive expression, the whole query looks like this:
Find pass-only except
blocks
import python
from ExceptStmt ex
where forall(Stmt s | s = ex.getAStmt() | s instanceof Pass)
select ex
Many codebases include pass-only except
blocks.
Summary¶
The most commonly used standard classes in the syntactic part of the library are organized as follows:
Module
, Class
, Function
, Stmt
, and Expr
- they are all subclasses of AstNode.
Abstract syntax tree¶
AstNode
Module
– A Python moduleClass
– The body of a class definitionFunction
– The body of a function definitionStmt
– A statementAssert
– Anassert
statementAssign
– An assignmentAssignStmt
– An assignment statement,x = y
ClassDef
– A class definition statementFunctionDef
– A function definition statement
AugAssign
– An augmented assignment,x += y
Break
– Abreak
statementContinue
– Acontinue
statementDelete
– Adel
statementExceptStmt
– Theexcept
part of atry
statementExec
– An exec statementFor
– Afor
statementIf
– Anif
statementPass
– Apass
statementPrint
– Aprint
statement (Python 2 only)Raise
– A raise statementReturn
– Areturn
statementTry
– Atry
statementWhile
– Awhile
statementWith
– Awith
statement
Expr
– An expressionAttribute
– An attribute,obj.attr
Call
– A function call,f(arg)
IfExp
– A conditional expression,x if cond else y
Lambda – A lambda expression
Yield
– Ayield
expressionBytes
– A bytes literal,b"x"
or (in Python 2)"x"
Unicode
– A unicode literal,u"x"
or (in Python 3)"x"
Num
– A numeric literal,3
or4.2
IntegerLiteral
FloatLiteral
ImaginaryLiteral
Dict
– A dictionary literal,{'a': 2}
Set
– A set literal,{'a', 'b'}
List
– A list literal,['a', 'b']
Tuple
– A tuple literal,('a', 'b')
DictComp
– A dictionary comprehension,{k: v for ...}
SetComp
– A set comprehension,{x for ...}
ListComp
– A list comprehension,[x for ...]
GenExpr
– A generator expression,(x for ...)
Subscript
– A subscript operation,seq[index]
Name
– A reference to a variable,var
UnaryExpr
– A unary operation,-x
BinaryExpr
– A binary operation,x+y
Compare
– A comparison operation,0 < x < 10
BoolExpr
– Short circuit logical operations,x and y
,x or y
Variables¶
Variable
– A variableLocalVariable
– A variable local to a function or a classGlobalVariable
– A module level variable
Other¶
Comment
– A comment
Control flow classes¶
This part of the library represents the control flow graph of each Scope
(classes, functions, and modules). Each Scope
contains a graph of ControlFlowNode
elements. Each scope has a single entry point and at least one (potentially many) exit points. To speed up control and data flow analysis, control flow nodes are grouped into basic blocks. For more information, see Basic block on Wikipedia.
Example¶
If we want to find the longest sequence of code without any branches, we need to consider control flow. A BasicBlock
is, by definition, a sequence of code without any branches, so we just need to find the longest BasicBlock
.
First of all we introduce a simple predicate bb_length()
which relates BasicBlock
s to their length.
int bb_length(BasicBlock b) {
result = max(int i | exists(b.getNode(i))) + 1
}
Each ControlFlowNode
within a BasicBlock
is numbered consecutively, starting from zero, therefore the length of a BasicBlock
is equal to one more than the largest index within that BasicBlock
.
Using this predicate we can select the longest BasicBlock
by selecting the BasicBlock
whose length is equal to the maximum length of any BasicBlock
:
Find the longest sequence of code without branches
import python
int bb_length(BasicBlock b) {
result = max(int i | exists(b.getNode(i)) | i) + 1
}
from BasicBlock b
where bb_length(b) = max(bb_length(_))
select b
Note
The special underscore variable
_
means any value; sobb_length(_)
is the length of any block.
Summary¶
The classes in the control-flow part of the library are:
- ControlFlowNode – A control-flow node. There is a one-to-many relation between AST nodes and control-flow nodes.
- BasicBlock – A non branching list of control-flow nodes.