We will describe the MiniJava lexical analyzer, also called scanner, which will group the characters
of the source file into lexical units also called tokens. These lexical units will then be given to the syntactic analyzer in the following phase of the transpiler.
Before describing the lexical analyzer of MiniJava, we will present regular expressions that will be used
to describe tokens. We will also study automata, a kind of graph, and use them to match regular expressions.
Regular expressions will allow us to describe succinctly and quite intuitively
the MiniJava lexical units and will be used in the ocamllex
lexical analyzer generator that we
will be using in our transpiler.
A regular expression describes a set of words from a given vocabulary. We will take as an example
the vocabulary $\mathcal{V} = \{0, 1\}$ consisting simply of two elements: 0
and 1
. We describe below the basic elements and operators for creating regular expressions and the words that
they describe.
Basic regular expressions:
For $c \in \mathcal{V}$, the regular expression $\color{green}c$ represents the set containing a single word: $\{c\}$.
$\color{green}0$ | $\{0\}$ |
$\color{green}1$ | $\{1\}$ |
Compound regular expressions:
Parentheses can be used to group together regular expressions. Let $\color{green}{r}$ be a regular expression, then $\color{green}{(r)}$ represents the same set of words as the expression $\color{\green}{r}$.
$\color{green}{(0)}$ | $\{0\}$ |
The concatenation operator allows to juxtapose the words generated by two regular expressions. Let $\color{green}{r_1}$ and $\color{green}{r_2}$ be two regular expressions.
The concatenation of these two regular expressions is noted: $\color{green}{r_1r_2}$. The set of words described by this regular expression is
the concatenation of the words described by $\color{green}{r_1}$ with those described by $\color{green}{r_2}$.
Note that this operator is associative, which means that for any
regular expression $\color{green}{r_1}$, $\color{green}{r_2}$ and $\color{green}{r_3}$, we have
$\color{green}{(r_1r_2)r_3} = \color{green}{r_1(r_2r_3)}$ that we will simply write down as $\color{green}{r_1r_2r_3}$.
$\color{green}{\epsilon1}$ | $\{1\}$ |
$\color{green}{10}$ | $\{10\}$ |
$\color{green}{(10)1}$ | $\{101\}$ |
$\color{green}{1(01)}$ | $\{101\}$ |
$\color{green}{101}$ | $\{101\}$ |
The union operator is used to get the union of the words generated by two regular expressions. Let $\color{green}{r_1}$ and $\color{green}{r_2}$ be two regular expressions.
The union of these two regular expressions is noted: $\color{green}{r_1\ |\ r_2}$. The set of words described by this regular expression is
the union of the words described by $\color{green}{r_1}$ with those described by $\color{green}{r_2}$.
Note that this operator is commutative, that is
$\color{green}{r_1\ |\ r_2} = \color{green}{r_2\ |\ r_1}$. It’s also associative, meaning that for every
regular expression $\color{green}{r_1}$, $\color{green}{r_2}$ and $\color{green}{r_3}$, we have
$\color{green}{(r_1\ |\ r_2)\ |\ r_3} = \color{green}{r_1\ |\ (r_2\ |\ r_3)}$ that we will simply write down as $\color{green}{r_1\ |\ r_2\ |\ r_3}$.
$\color{green}{\epsilon \ | \ 1}$ | $\{\epsilon, 1\}$ |
$\color{green}{(00) \ | \ (10)}$ | $\{00, 10\}$ |
$\color{green}{(10) \ | \ (00)}$ | $\{00, 10\}$ |
$\color{green}{(0 \ | \ 1)\ |\ (10)}$ | $\{0, 1, 10\}$ |
$\color{green}{0\ |\ (1 \ | \ (10))}$ | $\{0, 1, 10\}$ |
$\color{green}{0 \ | \ 1\ |\ (10)}$ | $\{0, 1, 10\}$ |
$\color{green}{(0\ |\ 1)(0\ |\ 1)}$ | $\{00, 01, 10, 11\}$ |
The iteration operator noted *
allows to juxtapose $0$ or several times the words generated by a regular expression. Let $\color{green}{r}$ be a regular expression, then
the regular expression $\color{green}{r^*}$ represents the hypothetical2 regular expression
$\color{\green}{\epsilon\ |\ rr\ |\ rrr\ |\ rrrr\ |\ \cdots}$.
$\color{green}{0^*}$ | $\{\epsilon, 0, 00, 000, 0000, \cdots\}$ |
$\color{green}{(0\ | \ 1)^*}$ | $\{\epsilon, 0, 1, 00, 01, 10, 11, 000, 001, 010, 011, 100, \cdots\}$ |
To avoid too many parentheses, there is a priority between the different operators: the parentheses have the highest priority, then the $\color{green}{*}$ operator, then the concatenation operator and finally the $\color{green}{|}$ operator. We have also seen above that the concatenation and union operators are associative, which allows us to remove more brackets. Thus, the regular expression $\color{green}{10^*1\ |\ 11\ |\ \epsilon}$ reads $\color{green}{(((1(0^ *))1)\ |\ (11))\ |\ \epsilon}$.
We give below some examples of regular expressions still on the vocabulary $\mathcal{V} = \{0, 1\}$.
Binary numbers (without leading zeros) | $\color{green}{0\ | \ 1(0\ | \ 1)^*}$ | $\{0, 1, 10, 11, 100, 101, 110, 111, 1000, 1001, 1010, 1011, 1100, \cdots\}$ |
Odd binary numbers | $\color{green}{1\ | \ 1(0\ | \ 1)^*1}$ | $\{1, 11, 101, 111, 1001, 1011, 1101, 1111, 10001, \cdots\}$ |
Bit strings of even length containing alternating zeros and ones | $\color{green}{(10)^* \ | \ (01)^*}$ | $\{\epsilon, 10, 01, 1010, 0101, 101010, 010101, \cdots\}$ |
Bit strings whose length is a multiple of 3 | $\color{green}{((0\ | \ 1)(0\ | \ 1)(0\ | \ 1))^*}$ | $\{\epsilon, 000, 001, 010, 011, 100, \cdots, 111000, 111001, \cdots, 101011110, \cdots \}$ |
Bit strings that do not contain the sub-string $11$ | $\color{green}{0^* ( 100^* )^* (1\ | \ \epsilon)}$ | $\{\epsilon, 0, 1, 00, 01, 10, 000, 001, 010, 100, 101, 0000, 0001, \cdots\}$ |
In the following video, we will formally define regular expressions and the languages they generate. We will also see how they are defined in
ocamllex
.
The following video will give examples of regular expressions found in MiniJava and show some extensions of regular expressions.
We will use the notation $\{a,b\}^*$ in the questions below: $\{a,b\}^*$ represents the language generated by the regular expression $\color{\green}{(a\ |\ b)^*}$.
We want to move in the grid below using the two actions: “go right” and “go up”. We start from the lower left corner and we want to get to the top right corner. A possible path is shown in the figure on the right.
In the previous section, we presented the regular expressions used to describe the so-called regular languages. This notation is useful to describe regular languages, and we will use it to describe lexical units in the MiniJava lexical analyzer. On the other hand, for recognition, i.e. to know whether a given word belongs to the language described by a regular expression, it is not easy to use a regular expression directly.
We will now describe finite, non-deterministic and deterministic automata, which make it easier to answer the question of whether a given word belongs to a given regular language. We will use these automata in the following section to build a software to test effectively, if a given word belongs to the language generated by a given regular expression.
Note that the languages described by finite automata (non-deterministic or deterministic) are regular languages, regular expressions and the automata are therefore two equivalent ways of describing the same languages.
The following figure represents a non-deterministic finite automaton, which we will call $A_{ndf}$3, which describes C comments of type /*...*/
.
To simplify, we assume that our vocabulary is $\mathcal{V} = \{ a, b, /, * \}$.
In this figure we can see the following:
states, the circles on the figure, numbered from 0
to 7
for this example.
You can see the starting state (or initial state), the 0
state, which has an arrow coming at it,
but which doesn’t start from any state. The 7
state is an accepting state (or final state), it is represented by a double circle.
transitions between states, the arrows on the figure. On the arrows, there are symbols
belonging to the vocabulary $\mathcal{V}$ or the symbol $\epsilon$. Note that on some transitions, for example the transition between state 3
and
state 2
, we put several symbols on the transition (on this transition there are the two symbols a
and b
).
Formally, we should have written two transitions instead of one, each with one of the two symbols, but doing as we did allows us to write the automaton more succinctly.
The automaton will allow us to know if a word m
constructed from the vocabulary $\mathcal{V}$ belongs to the language described by the automaton (we note this language $\mathcal{L}(A_{ndf})$).
Let’s consider the word /*/*/
, which we will call m
, that belongs to $\mathcal{V}^*$.
How do we know if this word is described by the $A_{ndf}$ automaton?
We will start from the initial state, the state 0
, and we will follow the transitions,
character after character, looking for a path that leads us to the accepting state 7
after reading all the characters of the word m
.
0
, there is only one transition, so there is no choice. The word must therefore begin with /
, because that is the symbol on that transition.
Once this transition is passed, we are in state 1
and we still have to analyze the */*/
part of m
.1
, there is also only one possible transition. We must therefore have the symbol *
in the remaining part */*/
of m
,
because it is the symbol on the only transition from state 1
.
Once this transition is done, we are in state 2
and we still have to analyze the /*/
part of m
.2
has three outgoing transitions. All three are labeled with the $\epsilon$ symbol. This symbol means that the input is not modified when
you go through such a transition. We can see now why the automaton is non-deterministic because on the same symbol, here $\epsilon$, we have the choice between several transitions.
How do we choose the transition to follow? We are going to assume for the moment that we have the skill of clairvoyance and that we are going
to choose the right transition, which is the one towards state 4
. We will see in the videos how to automate this.4
we still have the choice between two transitions: we can choose not to consume a character in the input by taking the $\epsilon$ transition, or to consume the /
character
by looping back to state 4
. As we are assuming that we can guess correctly, we will loop back to state 4
and consume the /
.*/
. We will take the $\epsilon$ transition to state 2
, then the $\epsilon$ transition from state 2
to state 5
. Once again,
we do not worry for the moment about how to make the right transition choices when there is more than one possibility. We now find ourselves in the configuration below, where the cursor under the input string
has not moved.5
, we only have an outgoing transition on the *
character. The cursor on the input is placed on the *
, so we can
take this transition and move the cursor to the right. Now we just have to recognize the /
.6
, there is also only one transition on the /
symbol. Since the cursor on the input points to a /
character, we can take this transition and move to
the accepting state 7
./*/*/
belongs to the $\mathcal{L}(A_{ndf})$ language.
The word /*/*/
is therefore a comment.The input string /*/*/
is accepted by our automaton, but how do we know if a string is not in the $\mathcal{L}(A_{ndf})$ language, that is to say
how do you know if the word is not accepted?
For a non-deterministic automaton, it must be shown that after reading all the characters of the input string, one cannot be in an accepting state.
It seems easier to construct the non-deterministic finite automaton we have just seen to describe the language of comments than the regular expression $/*\color{darkgreen}{(}*^{\color{darkgreen}{+}}\color{darkgreen}{(}a\ \color{darkgreen}{|}\ b\color{darkgreen}{)}\ \color{darkgreen}{|}\ \color{darkgreen}{(}a\ \color{darkgreen}{|}\ b\ \color{darkgreen}{|}\ /\color{darkgreen}{)}\color{darkgreen}{)}^{\color{darkgreen}{*}}*^{\color{darkgreen}{+}}/$ that we saw in the previous section. But you may be a Perl1 guru and it’s just too easy for you .
Perl
stands for Practical Extraction and Report Language, or Pathologically Eclectic Rubbish Lister . [return]In the two following videos, we will detail non-deterministic finite automata, and how to detect if a word belongs or not to the language generated by a non-deterministic finite automaton.
In the following video, we will show how to go from a regular expression to a non-deterministic finite automaton.
Deterministic finite automata are a subset of the non-deterministic finite automata. The advantage of these automata is that they no longer need to “guess” the right transition to follow because, in a given state and for a given symbol of the input, there is at most one possible transition. As we have seen in the previous section, we can actually use a non-deterministic automaton without the need to guess. The algorithm we have seen actually allows us to dynamically construct a deterministic finite automaton. The advantage of starting directly from a deterministic automaton is that one does not need to reconstruct it each time. This will be all the more interesting for a lexical analyser because the regular expressions used to describe the lexical units will not change and we will gain efficiency by building once and for all the automata corresponding to the regular expressions.
The following automaton is a
deterministic version of the non-deterministic automaton that recognizes the multiline comments in C
in the previous section.
Since deterministic finite automata are a restriction of non-deterministic finite automata, one could rightly believe that they allow us to describe fewer languages. In fact this is not the case and they are as powerful as non-deterministic finite automata.
The video below will describe deterministic finite automata and show how to transform a non-deterministic automaton into a deterministic one.
The code used in the previous video is available here.
In the following video, we will show how a lexical analyzer works and how to obtain a deterministic finite automaton of minimal size.
In the following videos, we will code in C++ a lexical analyzer for the small language fragment described in the previous video. The code used in these videos is available here.
We can automatically construct the regular expression corresponding to a finite automaton (deterministic or non-deterministic). We show below a series of transformations to go from the deterministic finite automaton corresponding to the comments in C seen above, to an equivalent regular expression. We will detail this transformation in the video below.
We can see, on the transitions, regular expressions appearing as the transformation progress. In order not to confuse the *
character with the operator
*, we write the operator in green.
First of all, we will rewrite the automaton by making clear the regular expressions representing the alternatives on the transitions.
We are now going to eliminate one state at a time to arrive at an automaton containing only two states: the initial state and the accepting state.
To eliminate the state $q = \{3,5,6\}$, for each pair of states $(q_1, q_2)$, if there is an arc between $q_1$ and $q$ and between $q$ and $q_2$, we need to keep this information by modifying the arc between $q_1$ and $q_2$ accordingly.
For example, here, we will have to consider the path $\{2,3,4,5\} \rightarrow q \rightarrow \{7\}$ and add the regular expression $**^{\color{darkgreen}{*}}/$ between the states $\{2,3,4,5\}$ and $\{7\}$ to keep the same information. We also need to consider the $\{2,3,4,5\} \rightarrow q \rightarrow \{2,3,4,5\}$ path and add the regular expression $**^{\color{darkgreen}{*}}{\color{darkgreen}{(}}a\mbox{ }{\color{darkgreen}{|}}\mbox{ }b{\color{darkgreen}{)}}$ on the loop on state $\{2,3,4,5\}$. We then get the following automaton.
By eliminating the state $\{2,3,4,5\}$ we get the next automaton.
And finally, by eliminating state $\{1\}$, we get the final regular expression on the arc connecting the initial state to the accepting state.
The following video will detail this construction.
The next videos will detail a program in OCaml to transform an automaton into a regular expression using the Floyd-Warshall algorithm.
The following video presents the Floyd-Warshall transitive closure algorithm on a graph to present the concepts more simply before moving on to the next video to the automatic creation of a regular expression from an automaton. The code presented in the video can be found here.
The following video describes the code that can transform an automaton into a regular expression. The code can be found here, and the little python script that transforms our representation into the expected one on this site can be found here.
We are going to put into practice the notions we have just seen about regular expressions and automata by making a small application to test whether or not a string matches a pattern represented by a regular expression.
We will present a sequence of interactions in the OCaml interpreter that we will be able to do with this application.
1utop # let re = RE.regex_from_string "0*(100*)*1?";;
2val re : RE.regex =
3 RE.Concatenation (RE.ZeroOrMore (RE.CharSet <abstr>),
4 RE.Concatenation
5 (RE.ZeroOrMore
6 (RE.Concatenation (RE.CharSet <abstr>,
7 RE.Concatenation (RE.CharSet <abstr>,
8 RE.ZeroOrMore (RE.CharSet <abstr>)))),
9 RE.ZeroOrOne (RE.CharSet <abstr>)))
On line 1, we create the regular expression $\color{darkgreen}{0^*(100^*)^*(1|\epsilon)}$ which allows us to describe the words that do not contain consecutive 1
.
Note that we use the notation 1?
to represent $\color{darkgreen}{(1|\epsilon)}$.
We then create an equivalent non-deterministic finite automaton.
utop # let nfa = NFA.init re;;
val nfa : NFA.t = <abstr>
We can then test if a string, here 101010
, belongs or not to the language generated by the non-deterministic finite automaton and thus by the regular expression.
utop # NFA.full_match nfa "101010";;
- : bool = true
We can see in the following example, that the string 011111100100
, containing consecutive 1
, does not match the regular expression.
utop # NFA.full_match nfa "011111100100";;
- : bool = false
We can test if a substring is in a given string by surrounding an expression with .*
. The .
represents any character.
The following example will define a regular expression to search for the substring Doc
.
utop # let re = RE.regex_from_string ".*Doc.*";;
val re : RE.regex =
RE.Concatenation (RE.ZeroOrMore (RE.CharSet <abstr>),
RE.Concatenation (RE.CharSet <abstr>,
RE.Concatenation (RE.CharSet <abstr>,
RE.Concatenation (RE.CharSet <abstr>, RE.ZeroOrMore (RE.CharSet <abstr>)))))
utop # let dfa = DFA.init re;;
val dfa : DFA.t = <abstr>
utop # DFA.full_match dfa "Wait a minute, Doc. Ah... Are you telling me that you built a time machine... out of a DeLorean?";;
- : bool = true
utop # DFA.full_match dfa "The way I see it, if you're gonna build a time machine into a car, why not do it with some style?";;
- : bool = false
The code that will be explained in the following videos can be found here.
In the following video, we will present an overview of the application and detail the code to go from a string representing a regular expression to an OCaml representation of this regular expression. The grammar describing regular expressions can be found here.
In the following video, we present the notion of programming with continuation that we will use in conjonction with backtracking in the pattern recognition module. The code to illustrate continuations can be found here.
There is the solution to one of the questions we ask in the video at the end of the listing on continuations.
In the following video, we will describe the pattern recognition module based on backtracking and continuations.
In the following video, we will describe the pattern recognition module based on non-deterministic finite automata.
In the following video, we will describe the pattern recognition module based on deterministic finite automata.
In the following video, we will show how we have tested our different modules.
The code below describes the part of the regex_from_string
function that handles concatenations.
1and re1 l =
2 let e, l = re2 l in
3 let e, l =
4 let rec re1' e l =
5 match l with
6 | '?' :: r -> re1' (ZeroOrOne e) r
7 | '*' :: r -> re1' (ZeroOrMore e) r
8 | '+' :: r -> re1' (OneOrMore e) r
9 | _ -> e, l
10 in
11 re1' e l
12 in
13 match l with
14 | c :: _ when c <> ')' && c <> '|' ->
15 let e', l = re1 l in
16 Concatenation (e, e'), l
17 | _ ->
18 e, l
In our DFA
module, we memoized the transitions already seen thanks to the Memo
module whose definition is given below.
1module Memo =
2 Map.Make(
3 struct
4 type t = S.t * char
5 let compare (s1, c1) (s2, c2) =
6 let res = compare c1 c2 in
7 if res = 0 then
8 S.compare s1 s2
9 else
10 res
11 end)
The key comparison function in this table, defined from line 5
to line 10
, may require to compare sets on line 8
.
When one is in a given state of the deterministic finite automaton, it is sufficient to look whether the transition on a particular character has already
been seen. There is therefore no need to compare sets and to have for each transition from the same state,
a key that contains a state and a character.
To avoid creating a table that requires a pair of state and character as a key, one should associate to each state of the deterministic finite automaton (which is
a set of states of the non-deterministic finite automaton) a table.
The keys of this table will be characters, and the table will allow to store the transitions already encountered.
The new module that we want to realize is the following.
module DFA2 : Matching = struct
(* TO DO *)
end
We will now describe the MiniJava lexical analyzer (or scanner) which is created using ocamllex.
The tool ocamllex
is a lexical analyzer generator. You give it a list of regular expressions with actions to perform when a regular expression is recognized.
The tool will then automatically generate a lexical analyzer that looks roughly like the lexer.cpp program
that have we studied above.
The following program shows a MiniJava program, Lexical.java
, which is invalid, but is nevertheless lexically correct.
class
/*/*/
public 123MrC00der;
while )(
{ int
int42
[]
// this sentence is false
If we execute the command ./mini-java --show-tokens-with-loc Lexical.java
to run our mini-java
transpiler with the option to output only the tokens produced by
the lexical analyzer, we obtain the following tokens4.
CLASS
PUBLIC
INT_CONST ‘123‘
IDENT ‘MrC00der‘ ▸ line 3, char 11 ◂
SEMICOLON
WHILE
RPAREN
LPAREN
LBRACE
INTEGER
IDENT ‘int42‘ ▸ line 6, char 1 ◂
LBRACKET
RBRACKET
EOF
The tokens will be used by the syntax analyzer that we will study in the next chapter.
The following video will introduce ocamllex
and describe the lexical analyzer of our transpiler. The code of the calculator in Reverse Polish Notation can be found
here.
We want to write a program, using ocamllex
, that replaces tabulations by four spaces and removes spaces
and tabulations just before the end of line. For example, suppose we have a file.txt
file whose content is shown below using
the unix cat
command to display tabulations, represented by ^I
, and line breaks represented by $
.
cat -ET file.txt
I wish you^I ^I$
a very happy new^I^I year^I $
^I$
$
If the ocamllex
file is called clean.mll
, it is compiled as shown below.
ocamllex clean.mll
ocamlopt clean.ml -o clean
We will then use the clean
program on a file.txt
file, for example, as follows.
./clean < fichier.txt > res.txt
We will then get in the res.txt
file, the contents of the file.txt
file where the tabulations have been transformed into four spaces, and where
the spaces and tabulations right before the end of line have been removed.
cat -ET res.txt
I wish you$
a very happy new year$
$
$
Play with regular expressions
Test regular expressions
Generate test-cases for regular expressions
Transform regular expressions into automata
Russ Cox on regular expression matching
Learn OCaml
Try OCaml
Functional programming course using OCaml
OCaml documentation
Ocamllex documentation
Ocamllex in Real World OCaml
ISO C++
C++ core guidelines
C++ standard