How Interprocedural Data-flow Works in Joern

In case you were wondering, Joern supports data-flow across procedure (and file) boundaries. This is based on preprocessing step of intraprocedural data-flow slicing done in parallel, followed by on-the-fly symbol tracking during query time. These are mainly accessed via data-flow query steps .

During these data-flow query steps, the main mechanisms at play are the call resolver and pre-defined semantics . These, along with the given source and sink traversals, are how you will achieve successful and precise interprocedural data-flow analysis.

Call Graphs in the CPG

The call resolver mechanism influences how the call steps work. For Joern-generated CPGs, the design decision we’ve opted for is a pre-generated call graph. This is useful for analysis that want to ingest the CPG in some complete form, and give a more “complete” CPG. This leads to faster interprocedural data-flow analysis as the calls have already been resolved (to some degree). The downside is that the CPG generation will take longer, and for dynamic languages, potentially involve a type-recovery step which is expensive.

Call Resolvers

This default behaviour is enabled with the implicit NoResolve class which simply follows the pre-defined CALL edges if present. These schema edges take the form of CALL -(CALL)-> METHOD.

If you would like to build something more on-the-fly, like a points-to analysis to resolve types and calls, you would simply need to implement ICallResolver . This may be beneficial in dynamic languages, where pre-computing types may not always be feasible.

External Calls

One important aspect of note, is that when a call-site cannot be resolved to a method, a bare-bones METHOD stub node is associated with it. This failure to resolve the method may be due to the call graph analysis, or simply that it is some external library not included in the given code analysed by Joern.

If this is the case, the method node will have a property isExternal = true and one will not be able to obtain reliable information such as parameter names or modifiers. However, this is a good way to identify potential sensitive sources, as many of these present themselves via external dependencies.

In terms of how Joern sees data-flow to-and-from these calls, this will be discussed under the next heading.

Semantics

A more extensive write-up of the semantics can be found in the documentation , but in the following, a basic description of how the data-flow engine interacts with these will be discussed.

Handling External Methods

When the data-flow engine tracks data-flow to a call-site, it needs to understand how data going in and out of the invocation will flow. For internally defined methods, this can simply be tracked within the CPG. For externally defined (or unresolvable) methods, this will be (soundly) overapproximated to propagate the taint to and from all parameters and return values.

The overapproximation will lead to imprecision, so Joern allows for simple method summaries to be defined. If a method cannot be internally resolved, Joern looks up the method against the list of default and user-supplied custom semantic definitions. These semantics prevent the data-flow engine from simply overapproximating which gives both a boost to precision and performance.

Examples

The following provides some query and testing tips around interprocedural data-flow tracking.

Performant Source-Sink Matching

If the language you are analysing is well-supported by Joern, then it would be safe to define your sources and sinks in the following way, e.g., for attacker controlled headers in HttpServletRequest:

cpg.typeDecl
  .fullNameExact("javax.servlet.http.HttpServletRequest")
  .method
  .nameExact("getHeader")
  .callIn
  .l // Gives us a `List[Call]` in the end

The graph is indexed by node labels, and the number of nodes in the CALL set generally far outnumber those in the METHOD or TYPE_DECL set. So starting from the smaller node set and jumping across edges will generally graph a performance boost to the query.

However, if the method is stubbed because it is external, then there will be no TYPE_DECL node connected to it, as stubs are not part of an AST. We can then do:

cpg.method
  .fullName("javax.servlet.http.HttpServletRequest.getHeader.*") 
  .callIn
  .l

When calling the fullName step without the Exact prefix, we enable the slower (but more flexible) regex-enabled string search. Here we use regex to ignore the signature part of the full name. Signatures are part of method full names for languages that have overriding.

However, if the call graph is unable to resolve methods, we may need to eventually match against the set of all call nodes.

cpg.call
  .methodFullName("javax.servlet.http.HttpServletRequest.getHeader.*") 
  .l

This is generally a bit slower, but are robust to languages that are prone to unresolved call-sites.

Testing with ScalaTest

One can import Joern’s test fixtures into their Joern-based static analysis tool by adding the following lines in their build.sbt:

libraryDependencies ++= Seq(
  // Core test dependencies
  "io.joern" %% "semanticcpg" % Versions.joern % Test classifier "tests",
  "io.joern" %% "x2cpg" % Versions.joern % Test classifier "tests",
  "io.joern" %% "dataflowengineoss" % Versions.joern % Test classifier "tests",
  // Whatever language frontend...
  "io.joern" %% "javasrc2cpg" % Versions.joern % Test classifier "tests", 
)

We can then create our own Scala tests to validate these data-flows.

class CrossFileFlowsTest extends JavaSrcCode2CpgFixture(withOssDataflow = true) {
  "A flow between files" should {

    val cpg = code(
      """
        |package crossfiletest;
        |import javax.servlet.http.HttpServletRequest;
        |public class Request {
        |    public static String getTheHeader(HttpServletRequest request) {
        |        return request.getHeader("some-header-name"); // `getHeader` is our source
        |    }
        |}
        |""".stripMargin, "crossfiletest/Request.java").moreCode(
      """
        |
        |package crossfiletest;
        |import javax.servlet.http.HttpServletRequest;
        |public class Main {
        |    public void processRequest(HttpServletRequest request) {
        |        String query = Request.getTheHeader(request); // `query` is our sink
        |    }
        |}
        |""".stripMargin, "crossfiletest/Main.java")

    "be found by the data-flow engine" in {
      val source = cpg.method
        .fullName(".*HttpServletRequest.getHeader.*")
        .where(_.isExternal)
        .callIn
        .l
      val sink = cpg.assignment
        .where(_.target.isIdentifier.nameExact("query"))
        .l

      sink.reachableBy(source).size should be > 0
    }

}

We can visualize the flows by changing the query to pretty-print to console using sink.reachableByFlows(source).p.foreach(println).

__________________________________________________________________________________________________________
| nodeType     | tracked                        | lineNumber| method         | file                       |
|=========================================================================================================|
| Call         | getHeader("some-header-name")  | 6         | getTheHeader   | crossfiletest/Request.java |
| Return       | // replace "some-header-nam... | 6         | getTheHeader   | crossfiletest/Request.java |
| MethodReturn | RET                            | 5         | getTheHeader   | crossfiletest/Request.java |
| Call         | getTheHeader(request)          | 7         | processRequest | crossfiletest/Main.java    |
| Identifier   | String query = getTheHeader... | 7         | processRequest | crossfiletest/Main.java    |
| Call         | String query = getTheHeader... | 7         | processRequest | crossfiletest/Main.java    |

Conclusion

In this post, we discussed, on a high level, some of the inner workings of the data-flow engines and what guarantees its success. If you have questions or would like to see more blogs like this, let us know in the Discord channel!

Happy hunting!