Introduction

We will build a transpiler, or source to source compiler, for a subset of Java, the MiniJava language. The input representation for our compiler is the MiniJava language and the output representation is the C language.
A sample program which computes the factorial of a given number is given below.

class Factorial {
    public static void main(String[] a) {
         System.out.println(new Fac().computeFac(10));
    }
}

class Fac {
    public int computeFac(int num) {
        int numAux;
        if (num < 1) numAux = 1;
        else numAux = num * (this.computeFac(num-1));
        return numAux;
    }
}

As Minijava is a subset of Java, we will be able to compile our MiniJava programs using the Java compiler javac. This will be very usefull to test the validity of our translations in C. Indeed, we will be able to compare the output of the program compiled with the Java compiler with the output of the program obtained by our transpiler.

Overview of the Transpiler

We will use the following MiniJava program as a running example for this section.

class Print42 {
    public static void main(String[] a) {
            System.out.println(35 + 2 * 3 + 1);
    }
}

The following figure shows the main steps to transform a MiniJava program to a transpiled C source file.

graph LR; A[MiniJava source file] -->|characters| B(fa:fa-tools
Lexical
analysis) B -->|tokens| C(fa:fa-tools
Syntactic
analysis) C -->|abstract syntax tree| D(fa:fa-tools
Typechecker) D -->|abstract syntax tree| E(fa:fa-tools
C code
generator) E -->|characters| F[C source file]

We will now describe the different steps shown on the figure above.

To follow along with the videos that follow, start to install the dependencies as stated here. Then download the code

git clone --recurse-submodules git@github.com:lascar-pacagi/MiniJava.git
cd MiniJava
git checkout v1.0
make

The master branch is the version with a garbage collector and the tag v1.0 is a version without a garbage collector. If you want to modify the version 1.0, you can create a new branch (from_v1.0 for example)

git checkout -b from_v1.0 v1.0

The code we use during the videos is given below.

Lexical Analysis

The first step is the lexical analysis. This phase will group characters into words. Those words are called tokens. After this step we get a more structured information where the keywords of MiniJava, the identifiers, the integer constants and the boolean constants have been recognized. This is in this phase that commentaries and whitespaces are suppressed.
For the program given above, the scanner, the program responsible of the lexical analysis phase, will produce the following stream of tokens. For example, we can see that the keyword CLASS has been recognized, as the integer constant INT_CONST 35.

CLASS
IDENT ‘Print42‘
LBRACE
PUBLIC
STATIC
VOID
MAIN
LPAREN
STRING
LBRACKET
RBRACKET
IDENT ‘a‘
RPAREN
LBRACE
SYSO
LPAREN
INT_CONST ‘35‘
PLUS
INT_CONST ‘2‘
TIMES
INT_CONST ‘3‘
PLUS
INT_CONST ‘1‘
RPAREN
SEMICOLON
RBRACE
RBRACE
EOF

Syntactic Analysis

The second step, the syntactic analysis, takes as input the stream of tokens and produces an abstract syntax tree which represents the structure of the program with a tree data structure.
For the program given above, the parser, the program responsible of the syntactic analysis phase, will produce the following tree. We can see the arithmetic expression with the operators precedence explicitly stated (expressions deeper in the tree have more priority).

program
  ├name Print42
  ├main_args a
  └main
     └ISyso
        └EBinOp OpAdd
          ├EBinOp OpAdd
          │  ├EConst (ConstInt 35)
          │  └EBinOp OpMul
          │    ├EConst (ConstInt 2)
          │    └EConst (ConstInt 3)
          └EConst (ConstInt 1)

The syntactic analyser needs to know the structure of a MiniJava program. This structure is given in the form of a grammar. The MiniJava grammar, in its EBNF form is given below. Another version, certainly easier to read for us, using syntax diagrams is given here.1

Program = MainClass { ClassDeclaration } 'eof' ;

MainClass = 'class' Identifier '{' 'public' 'static' 'void' 'main' '(' 'String' '[' ']' Identifier ')' '{' Statement '}' '}' ;

ClassDeclaration = 'class' Identifier [ 'extends' Identifier ] '{' { VarDeclaration } { MethodDeclaration } '}' ;

VarDeclaration = Type Identifier ';' ;

MethodDeclaration = 'public' Type Identifier '(' [ Type Identifier { ',' Type Identifier } ] ')' '{' { VarDeclaration } { Statement } 'return' Expression ';' '}' ;

Type = 'int' '[' ']'
        | 'boolean'
        | 'int'
        |  Identifier ;

Statement = '{' { Statement } '}'
        | 'if' '(' Expression ')' Statement 'else' Statement
        | 'while' '(' Expression ')' Statement
        | 'System.out.println' '(' Expression ')' ';'
        | Identifier '=' Expression ';'
        | Identifier '[' Expression ']' '=' Expression ';' ;

Expression = Expression ( '&&' | '<' | '+' | '-' | '*' ) Expression
        | Expression '[' Expression ']'
        | Expression '.' 'length'
        | Expression '.' Identifier '(' [Expression {',' Expression}] ')'
        | Integer
        | 'true'
        | 'false'
        | Identifier
        | 'this'
        | 'new' 'int' '[' Expression ']'
        | 'new' Identifier '(' ')'
        | '!' Expression
        | '(' Expression ')' ;

Letter = 'A' | 'B' | 'C' | 'D' | 'E' | 'F' | 'G'
       | 'H' | 'I' | 'J' | 'K' | 'L' | 'M' | 'N'
       | 'O' | 'P' | 'Q' | 'R' | 'S' | 'T' | 'U'
       | 'V' | 'W' | 'X' | 'Y' | 'Z' | 'a' | 'b'
       | 'c' | 'd' | 'e' | 'f' | 'g' | 'h' | 'i'
       | 'j' | 'k' | 'l' | 'm' | 'n' | 'o' | 'p'
       | 'q' | 'r' | 's' | 't' | 'u' | 'v' | 'w'
       | 'x' | 'y' | 'z' ;

Digit = '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9' ;

Integer = Digit { Digit } ;

Character = Letter | Digit | '_' ;

Identifier = Letter { Letter | Digit | '_' } ;

Typechecker

The third step, the typechecker, takes as input the abstract syntax tree and will check if the types are correct. For example, the typechecker will check if we call a method with the correct number of arguments, that the operators + and * are used with integer operands, that a class is compatible with another one using the class hierarchy, …

Code Generation

The last step is the code generator phase. We will again visit the abstract syntax tree to generate C code. For our running example, we get the following C source file.

#include <stdio.h>
int main(int argc, char *argv[]) {
  printf("%d\n", ((35 + (2 * 3)) + 1));
  return 0;
}

Semantic of a MiniJava Program

The semantic of a MiniJava program is given by its semantic as a Java program2. The main restrictions are

  • No class has Object as a superclass.
  • No super keyword.
  • Only a default constructor.
  • The only types available are
    • int.
    • boolean.
    • int[].
    • User defined classes.
  • No overloading.
  • All methods return a value. The return statement is at the end.
  • System.out.println() only prints integers.
  • No interfaces, no exceptions, no generic types, no lambdas.

We explain in the following video the main differences between Java and MiniJava.

We described briefly in the video the default initialization mechanism in Java (and MiniJava). We give further details below. You can look here for all the details.

Let’s consider the following Java program Init.java.

1class Init {
2    public static void main(String[] args) {
3        int i;
4        boolean b;
5        int[] a;
6        System.out.println(i + " " + b + " " + a);
7    }
8}

When we compile it, we get the following error messages.

$ javac Init.java
Init.java:6: error: variable i might not have been initialized
        System.out.println(i + " " + b + " " + a);
                           ^
Init.java:6: error: variable b might not have been initialized
        System.out.println(i + " " + b + " " + a);
                                     ^
Init.java:6: error: variable a might not have been initialized
        System.out.println(i + " " + b + " " + a);
                                               ^
3 errors

Indeed, the three variables i, b and a are put on the stack when main is called. Java does not initialize local variables, and their values are whatever was currently in those locations on the stack. The compiler rejects this program because of the arbitrary values that go into those variables, by telling us that those variables have not been initialized.

On the other hand, the following program is correct, because attributes have a default value.

 1class Init {
 2    public static void main(String[] args) {
 3        new Default().print();
 4    }
 5}
 6
 7class Default {
 8    private int i;
 9    private boolean b;
10    private int[] a;
11
12    public void print() {
13        System.out.println(i + " " + b + " " + a);
14    }
15}

And we get the following output.

$ javac Init.java
$ java Init
0 false null

Review of Java’s Dynamic Binding

We will review dynamic binding in Java in the following video, because the management of dynamic binding will be one of the most difficult point to handle in our transpiler.

Questions

Let’s have a look again at the MiniJava grammar. We would like to add to the language

  • The == operator.
  • Constructors.
  • private methods and constructors.
How to update the grammar to add those new elements?

In the previous question, what terminals did you add to the grammar (a terminal in the grammar will become a token for the scanner)?

Consider the following Java program.

 1class A {
 2    public int m1(int n) {
 3        System.out.println("int A:m1(int n)");
 4        return 0;
 5    }
 6}
 7
 8class B extends A {
 9    public boolean m1(int n) {
10        System.out.println("boolean B:m1(int n)");
11        return false;
12    }
13}
Does this code compile?

Consider the following Java program.

 1class A {
 2    public int m1(int n) {
 3        System.out.println("int A:m1(int n)");
 4        return 0;
 5    }
 6    public boolean m1(int n) {
 7        System.out.println("boolean A:m1(int n)");
 8        return false;
 9    }
10}
Does this code compile?

Consider the following Java program.

 1class A {
 2    public int m1(A a) {
 3        System.out.println("int A:m1(A a)");
 4        return 0;
 5    }
 6    public boolean m1(B b) {
 7        System.out.println("boolean A:m1(B b)");
 8        return false;
 9    }
10}
11
12class B extends A {
13}
Does this code compile?

Ressources


  1. The description of the grammar used to generate the diagram does not follow exactly the EBNF form. Details are given here. [return]
  2. See here for the semantic of the Java language. [return]