CodeQL library for Java
codeql/java-all 0.5.5 ( changelog , source )
Search

Module ExponentialBackTracking

This library implements the analysis described in the following two papers:

James Kirrage, Asiri Rathnayake, Hayo Thielecke: Static Analysis for Regular Expression Denial-of-Service Attacks. NSS 2013. (http://www.cs.bham.ac.uk/~hxt/research/reg-exp-sec.pdf) Asiri Rathnayake, Hayo Thielecke: Static Analysis for Regular Expression Exponential Runtime via Substructural Logics. 2014. (https://www.cs.bham.ac.uk/~hxt/research/redos_full.pdf)

The basic idea is to search for overlapping cycles in the NFA, that is, states q such that there are two distinct paths from q to itself that consume the same word w.

For any such state q, an attack string can be constructed as follows: concatenate a prefix v that takes the NFA to q with n copies of the word w that leads back to q along two different paths, followed by a suffix x that is not accepted in state q. A backtracking implementation will need to explore at least 2^n different ways of going from q back to itself while trying to match the n copies of w before finally giving up.

Now in order to identify overlapping cycles, all we have to do is find pumpable forks, that is, states q that can transition to two different states r1 and r2 on the same input symbol c, such that there are paths from both r1 and r2 to q that consume the same word. The latter condition is equivalent to saying that (q, q) is reachable from (r1, r2) in the product NFA.

This is what the library does. It makes a simple attempt to construct a prefix v leading into q, but only to improve the alert message. And the library tries to prove the existence of a suffix that ensures rejection. This check might fail, which can cause false positives.

Finally, sometimes it depends on the translation whether the NFA generated for a regular expression has a pumpable fork or not. We implement one particular translation, which may result in false positives or negatives relative to some particular JavaScript engine.

More precisely, the library constructs an NFA from a regular expression r as follows:

  • Every sub-term t gives rise to an NFA state Match(t,i), representing the state of the automaton before attempting to match the ith character in t.
  • There is one accepting state Accept(r).
  • There is a special AcceptAnySuffix(r) state, which accepts any suffix string by using an epsilon transition to Accept(r) and an any transition to itself.
  • Transitions between states may be labelled with epsilon, or an abstract input symbol.
  • Each abstract input symbol represents a set of concrete input characters: either a single character, a set of characters represented by a character class, or the set of all characters.
  • The product automaton is constructed lazily, starting with pair states (q, q) where q is a fork, and proceeding along an over-approximate step relation.
  • The over-approximate step relation allows transitions along pairs of abstract input symbols where the symbols have overlap in the characters they accept.
  • Once a trace of pairs of abstract input symbols that leads from a fork back to itself has been identified, we attempt to construct a concrete string corresponding to it, which may fail.
  • Lastly we ensure that any state reached by repeating n copies of w has a suffix x (possible empty) that is most likely not accepted.

Import path

import semmle.code.java.security.regexp.ExponentialBackTracking

Imports

Dep

A parameterized module implementing the analysis described in the above papers.