NJU Static Program Analysis 02: Intermediate Representation

NJU Static Program Analysis 02:
Intermediate Representation

Abstraction

  1. The relation between compilers and static analysis
  2. Understand 3AC and its common forms (in IR jimple)
  3. How to build basic blocks on top of IR
  4. How to construct control flow graphs on top of BBs?

Notes

Intermediate Representation (IR) is a kind of code generated by the compiler during Translation and is usually considered as the basis of static analysis. It is somehow like a not-that-high-level programming language without any grammar sugar, or an assembly language with more details like typing system. The features of IR includes:

  • low-level and closed to machine code
  • usually independent to a specific programming language
  • compact and uniform
  • contains control flow information

One of the most popular IR is Three-Address Code (3AC), each of its sentance contains at most three addresses (Address can be Name, Constant, or Compiler-generated temporary variables).

For example the code:

do {
    i = i + 1;
} while (a[i] < v);

can be translated to

1: i = i + 1
2: t1 = a[ i ]
3: if t1 < v goto 1

We have these common forms of 3AC like

x = y bop z, x = uop y, x = y, goto L, if x goto L, if x rop y goto L,

where:

x, y, z refer to addresses,
bop is a binary arithmetic operation,
uop is a unary operation,
rop is a relational operation,
L is a label to represent a program location,

Another popular IR is Statuc Single Assignment (SSA). It distinct variables by not only its definition or scope, but also the sentence it belongs to. For instance,

// 3AC
1: p = a + b
2: q = p - c
3: p = q * d
4: p = e - p
5: q = p + q

just equals to

//SSA
1: p_1 = a + b
2: q_1 = p_1 - c
3: p_2 = q_1 * d
4: p_3 = e - p_2
5: q_2 = p_3 + q_1

and if there is a variable used when control flow merges, we introduce a (phi) function.

image-20210702190353773

Obviously, compared to 3AC, SSA provides more explicit define-and-use pairs and more directly incorporated flow information, which helps some optimization, data-demanding or other tasks. At meanwhile, SSA can intorduce too many variables and (phi) functions, causing further efficiency porblems.

Based on the intermediate representation we have, generating a Control Flow Graph (CFG) of the given prgram becomes possible. CFG is a graph stucture consisting vertexes abstracted from a complete block of codes and edges abstracted from the goto instructions (the default edge constructed as sequential order is ignored). For example we have:

image-20210702192032983

(The black edges here are ignored in an actual CFG)

Each vertex is a set individual 3-address instruction called Basic Block (BB). A BB should be the maximal sequences of consecutice 3AC instructions with properties that it can only be entered at the beginning and exited at the end. We can easily come up with an algorithm divide a whole passage of 3AC into BBs: claim each line a goto sentence starts from and end to as divisions, we naturally got the required BBs.

And we add an edge from block A to B in a CFG if and only if when there is a jump from the end of A to the beginning of B except that B sequentially follows A and the jump from A to B is an unconditional one.

原文地址:https://www.cnblogs.com/Shimarin/p/14964800.html