MrCoder > Compiler > MiniJava > Lexical Analysis

Lexical Analysis

We will describe the MiniJava lexical analyzer, also called scanner, which will group the characters of the source file into lexical units also called tokens. These lexical units will then be given to the syntactic analyzer in the following phase of the transpiler.
Before describing the lexical analyzer of MiniJava, we will present regular expressions that will be used to describe tokens. We will also study automata, a kind of graph, and use them to match regular expressions.

Regular Expressions

Regular expressions will allow us to describe succinctly and quite intuitively the MiniJava lexical units and will be used in the ocamllex lexical analyzer generator that we will be using in our transpiler.

A regular expression describes a set of words from a given vocabulary. We will take as an example the vocabulary $\mathcal{V} = \{0, 1\}$ consisting simply of two elements: 0 and 1. We describe below the basic elements and operators for creating regular expressions and the words that they describe.

Basic regular expressions:
- The regular expression $\color{green}\epsilon$ generates the set containing only the empty word¹: $\{\epsilon\}$.
- For $c \in \mathcal{V}$, the regular expression $\color{green}c$ represents the set containing a single word: $\{c\}$.
  
  Expression Set of words
  
  $\color{green}0$ $\{0\}$
  
  $\color{green}1$ $\{1\}$

Expression	Set of words
$\color{green}0$	$\{0\}$
$\color{green}1$	$\{1\}$

Compound regular expressions:

Parentheses can be used to group together regular expressions. Let $\color{green}{r}$ be a regular expression, then $\color{green}{(r)}$ represents the same set of words as the expression $\color{\green}{r}$.

Expression Set of words

$\color{green}{(0)}$ $\{0\}$

Expression	Set of words
$\color{green}{(0)}$	$\{0\}$

The concatenation operator allows to juxtapose the words generated by two regular expressions. Let $\color{green}{r_1}$ and $\color{green}{r_2}$ be two regular expressions. The concatenation of these two regular expressions is noted: $\color{green}{r_1r_2}$. The set of words described by this regular expression is the concatenation of the words described by $\color{green}{r_1}$ with those described by $\color{green}{r_2}$.
Note that this operator is associative, which means that for any regular expression $\color{green}{r_1}$, $\color{green}{r_2}$ and $\color{green}{r_3}$, we have $\color{green}{(r_1r_2)r_3} = \color{green}{r_1(r_2r_3)}$ that we will simply write down as $\color{green}{r_1r_2r_3}$.

Expression	Set of words
$\color{green}{\epsilon1}$	$\{1\}$
$\color{green}{10}$	$\{10\}$
$\color{green}{(10)1}$	$\{101\}$
$\color{green}{1(01)}$	$\{101\}$
$\color{green}{101}$	$\{101\}$

The union operator is used to get the union of the words generated by two regular expressions. Let $\color{green}{r_1}$ and $\color{green}{r_2}$ be two regular expressions. The union of these two regular expressions is noted: $\color{green}{r_1\ |\ r_2}$. The set of words described by this regular expression is the union of the words described by $\color{green}{r_1}$ with those described by $\color{green}{r_2}$.
Note that this operator is commutative, that is $\color{green}{r_1\ |\ r_2} = \color{green}{r_2\ |\ r_1}$. It’s also associative, meaning that for every regular expression $\color{green}{r_1}$, $\color{green}{r_2}$ and $\color{green}{r_3}$, we have $\color{green}{(r_1\ |\ r_2)\ |\ r_3} = \color{green}{r_1\ |\ (r_2\ |\ r_3)}$ that we will simply write down as $\color{green}{r_1\ |\ r_2\ |\ r_3}$.

Expression	Set of words
$\color{green}{\epsilon \ \| \ 1}$	$\{\epsilon, 1\}$
$\color{green}{(00) \ \| \ (10)}$	$\{00, 10\}$
$\color{green}{(10) \ \| \ (00)}$	$\{00, 10\}$
$\color{green}{(0 \ \| \ 1)\ \|\ (10)}$	$\{0, 1, 10\}$
$\color{green}{0\ \|\ (1 \ \| \ (10))}$	$\{0, 1, 10\}$
$\color{green}{0 \ \| \ 1\ \|\ (10)}$	$\{0, 1, 10\}$
$\color{green}{(0\ \|\ 1)(0\ \|\ 1)}$	$\{00, 01, 10, 11\}$

The iteration operator noted * allows to juxtapose $0$ or several times the words generated by a regular expression. Let $\color{green}{r}$ be a regular expression, then the regular expression $\color{green}{r^*}$ represents the hypothetical² regular expression $\color{\green}{\epsilon\ |\ rr\ |\ rrr\ |\ rrrr\ |\ \cdots}$.

Expression Set of words

$\color{green}{0^*}$ $\{\epsilon, 0, 00, 000, 0000, \cdots\}$

$\color{green}{(0\ | \ 1)^*}$ $\{\epsilon, 0, 1, 00, 01, 10, 11, 000, 001, 010, 011, 100, \cdots\}$

Expression	Set of words
$\color{green}{0^*}$	$\{\epsilon, 0, 00, 000, 0000, \cdots\}$
$\color{green}{(0\ \| \ 1)^*}$	$\{\epsilon, 0, 1, 00, 01, 10, 11, 000, 001, 010, 011, 100, \cdots\}$

To avoid too many parentheses, there is a priority between the different operators: the parentheses have the highest priority, then the $\color{green}{*}$ operator, then the concatenation operator and finally the $\color{green}{|}$ operator. We have also seen above that the concatenation and union operators are associative, which allows us to remove more brackets. Thus, the regular expression $\color{green}{10^*1\ |\ 11\ |\ \epsilon}$ reads $\color{green}{(((1(0^ *))1)\ |\ (11))\ |\ \epsilon}$.

Examples

We give below some examples of regular expressions still on the vocabulary $\mathcal{V} = \{0, 1\}$.

Description	Expression	Set of words
Binary numbers (without leading zeros)	$\color{green}{0\ \| \ 1(0\ \| \ 1)^*}$	$\{0, 1, 10, 11, 100, 101, 110, 111, 1000, 1001, 1010, 1011, 1100, \cdots\}$
Odd binary numbers	$\color{green}{1\ \| \ 1(0\ \| \ 1)^*1}$	$\{1, 11, 101, 111, 1001, 1011, 1101, 1111, 10001, \cdots\}$
Bit strings of even length containing alternating zeros and ones	$\color{green}{(10)^* \ \| \ (01)^*}$	$\{\epsilon, 10, 01, 1010, 0101, 101010, 010101, \cdots\}$
Bit strings whose length is a multiple of 3	$\color{green}{((0\ \| \ 1)(0\ \| \ 1)(0\ \| \ 1))^*}$	$\{\epsilon, 000, 001, 010, 011, 100, \cdots, 111000, 111001, \cdots, 101011110, \cdots \}$
Bit strings that do not contain the sub-string $11$	$\color{green}{0^* ( 100^* )^* (1\ \| \ \epsilon)}$	$\{\epsilon, 0, 1, 00, 01, 10, 000, 001, 010, 100, 101, 0000, 0001, \cdots\}$

You can have fun using regular expressions on Regex Crossword .

In the following video, we will formally define regular expressions and the languages they generate. We will also see how they are defined in ocamllex.

The following video will give examples of regular expressions found in MiniJava and show some extensions of regular expressions.

Questions

We will use the notation $\{a,b\}^*$ in the questions below: $\{a,b\}^*$ represents the language generated by the regular expression $\color{\green}{(a\ |\ b)^*}$.

Let's consider the alphabet $\{a, b\}$. Give a regular expression to describe the language: $\{ w \in \{a, b\}^*\ |\ w$ contains the words $aa$ or $bb$ $\}$.

This question, although one might think that it is very similar to the previous one, is not so easy. You can come back to this question after studying the next section on automata. Let's consider the alphabet $\{a, b\}$. Give a regular expression to describe the language: $\{ w \in \{a, b\}^*\ |\ w$ does not contain the words $aa$ or $bb\}$.

We want to move in the grid below using the two actions: “go right” and “go up”. We start from the lower left corner and we want to get to the top right corner. A possible path is shown in the figure on the right.

Write a regular expression to describe all the actions that lead from the lower left corner to the upper right corner.

This question is not too easy. You can come back to this question after studying the next section on automata. Let's consider the alphabet $\{a, b\}$. Give a regular expression to describe the language: $\{ w \in \{a, b\}^*\ |\ w$ contains an even number of $a\}$.

Let's consider the alphabet $\{0, 1\}$. What is the language described by the following regular expression: $\color{green}{0^*10^*10^*(10^*\ |\ \epsilon)}$?

Automata

In the previous section, we presented the regular expressions used to describe the so-called regular languages. This notation is useful to describe regular languages, and we will use it to describe lexical units in the MiniJava lexical analyzer. On the other hand, for recognition, i.e. to know whether a given word belongs to the language described by a regular expression, it is not easy to use a regular expression directly.

We will now describe finite, non-deterministic and deterministic automata, which make it easier to answer the question of whether a given word belongs to a given regular language. We will use these automata in the following section to build a software to test effectively, if a given word belongs to the language generated by a given regular expression.

Note that the languages described by finite automata (non-deterministic or deterministic) are regular languages, regular expressions and the automata are therefore two equivalent ways of describing the same languages.

Non-deterministic Finite Automata

The following figure represents a non-deterministic finite automaton, which we will call $A_{ndf}$³, which describes C comments of type /*...*/. To simplify, we assume that our vocabulary is $\mathcal{V} = \{ a, b, /, * \}$. In this figure we can see the following:

states, the circles on the figure, numbered from 0 to 7 for this example. You can see the starting state (or initial state), the 0 state, which has an arrow coming at it, but which doesn’t start from any state. The 7 state is an accepting state (or final state), it is represented by a double circle.
transitions between states, the arrows on the figure. On the arrows, there are symbols belonging to the vocabulary $\mathcal{V}$ or the symbol $\epsilon$. Note that on some transitions, for example the transition between state 3 and state 2, we put several symbols on the transition (on this transition there are the two symbols a and b). Formally, we should have written two transitions instead of one, each with one of the two symbols, but doing as we did allows us to write the automaton more succinctly.

The automaton will allow us to know if a word m constructed from the vocabulary $\mathcal{V}$ belongs to the language described by the automaton (we note this language $\mathcal{L}(A_{ndf})$).

Let’s consider the word /*/*/, which we will call m, that belongs to $\mathcal{V}^*$. How do we know if this word is described by the $A_{ndf}$ automaton?

We will start from the initial state, the state 0, and we will follow the transitions, character after character, looking for a path that leads us to the accepting state 7 after reading all the characters of the word m.

Starting from state 0, there is only one transition, so there is no choice. The word must therefore begin with /, because that is the symbol on that transition. Once this transition is passed, we are in state 1 and we still have to analyze the */*/ part of m.

Starting from state 1, there is also only one possible transition. We must therefore have the symbol * in the remaining part */*/ of m, because it is the symbol on the only transition from state 1. Once this transition is done, we are in state 2 and we still have to analyze the /*/ part of m.

The state 2 has three outgoing transitions. All three are labeled with the $\epsilon$ symbol. This symbol means that the input is not modified when you go through such a transition. We can see now why the automaton is non-deterministic because on the same symbol, here $\epsilon$, we have the choice between several transitions. How do we choose the transition to follow? We are going to assume for the moment that we have the skill of clairvoyance and that we are going to choose the right transition, which is the one towards state 4. We will see in the videos how to automate this.

In state 4 we still have the choice between two transitions: we can choose not to consume a character in the input by taking the $\epsilon$ transition, or to consume the / character by looping back to state 4. As we are assuming that we can guess correctly, we will loop back to state 4 and consume the /.

Now, the entry we have left to consume is */. We will take the $\epsilon$ transition to state 2, then the $\epsilon$ transition from state 2 to state 5. Once again, we do not worry for the moment about how to make the right transition choices when there is more than one possibility. We now find ourselves in the configuration below, where the cursor under the input string has not moved.

In state 5, we only have an outgoing transition on the * character. The cursor on the input is placed on the *, so we can take this transition and move the cursor to the right. Now we just have to recognize the /.

In state 6, there is also only one transition on the / symbol. Since the cursor on the input points to a / character, we can take this transition and move to the accepting state 7.

Since the input string is now empty and we are in an accepting state, we can conclude that the word /*/*/ belongs to the $\mathcal{L}(A_{ndf})$ language. The word /*/*/ is therefore a comment.

The input string /*/*/ is accepted by our automaton, but how do we know if a string is not in the $\mathcal{L}(A_{ndf})$ language, that is to say how do you know if the word is not accepted? For a non-deterministic automaton, it must be shown that after reading all the characters of the input string, one cannot be in an accepting state.

It seems easier to construct the non-deterministic finite automaton we have just seen to describe the language of comments than the regular expression $/*\color{darkgreen}{(}*^{\color{darkgreen}{+}}\color{darkgreen}{(}a\ \color{darkgreen}{|}\ b\color{darkgreen}{)}\ \color{darkgreen}{|}\ \color{darkgreen}{(}a\ \color{darkgreen}{|}\ b\ \color{darkgreen}{|}\ /\color{darkgreen}{)}\color{darkgreen}{)}^{\color{darkgreen}{*}}*^{\color{darkgreen}{+}}/$ that we saw in the previous section. But you may be a Perl¹ guru and it’s just too easy for you .

Perl stands for Practical Extraction and Report Language, or Pathologically Eclectic Rubbish Lister . ^[return]

In the two following videos, we will detail non-deterministic finite automata, and how to detect if a word belongs or not to the language generated by a non-deterministic finite automaton.

In the following video, we will show how to go from a regular expression to a non-deterministic finite automaton.

Questions

Let's consider the alphabet $\{a, b\}$. Build an automaton that recognizes the following language: $\{ w \in \{a, b\}^*\ |\ w$ contains the word $aba\}$. For example, $aba$ is in the language, as well as $bbbbbaabaaaabb$, but not $babbbaaa$.

Let's consider the alphabet $\{a, b\}$. Construct an automaton that recognizes the language: $\{ w \in \{a, b\}^*\ |\ w$ does not contain the word $aba$ unless it is preceded by the word $bbb\}$. For example, $aaabbbaabaa$ is in the language, $abba$ is in the language, but not $bbabab$.

The upper part of the automaton, the states 1, 2, 3 and 4, is used to recognize a sequence of three b followed by anything. The lower part of the automaton is used to recognize anything but aba. The state 5 indicates that you haven’t yet seen an a or that you’ve just encountered a sequence ending in bb (so you’re sure you haven’t seen a sequence ending in ab). The state 6 indicates that we are analyzing a sequence of at least one a and state 7 that we just saw ab, so we must not have an a now. From states 5, 6 and 7 we can reach the upper part of the automaton because we have just analyzed a correct prefix and we may want to add aba in the following (adding bbb before).

How can we be sure that the lower part recognizes everything but aba? For the upper part, it is quite easy to convince yourself that it recognizes $bbb\color{darkgreen}{(}a\ \color{darkgreen}{|}\ b\color{darkgreen}{)}^{\color{darkgreen}{*}}$. But it’s not so easy to convince yourself that the bottom part describes everything but the aba chain. When you really want to be sure, there is only one way, you have to make a proof! Sure, You can tell me that it still has to be correct. You are right , but in a proof you just have to convince yourself that every elementary step is correct.

We will do a proof by induction on the length of the input string. For a string of length 0 ($\epsilon$) of length 1 ($a$ and $b$) and of length 2 ($aa$, $ab$, $ba$ and $bb$), we can follow the transitions from state 0 to the lower part and see that they are all recognized and they do not have aba in them (because the string length is less than or equal to 2). Suppose that the property holds for strings of length $n \ge 2$, is it still true for strings of length $n + 1$? Let’s look at the last two characters of the string m of length $n$.

m ends with aa. In this case we must be in state 6 as shown above. We can add an a and accept the new ma string because we stay in state 6 which is an accepting state, and we can also add a b and accept the mb string because we are in state 7 which is also an accepting state.
m ends with ab. In this case we must be in state 7 as indicated above. You cannot add an a because there is no transition on the character a from state 7 and therefore a string containing aba will not be recognized. However, you can add a b and still recognize the string of length $n+1$ mb.
m ends with ba. In this case we must be in state 6 as indicated above and we can add an a or a b to get the strings of length $n+1$ ma or mb.
m ends with bb. In this case we must be in state 5 as above and we can add an a or b to get the strings of length $n+1$ ma or mb.

So assuming that we can generate all words of length $n$ not containing aba, we have just shown that we can generate all words of length $n+1$ not containing aba.

Deterministic Finite Automata

Deterministic finite automata are a subset of the non-deterministic finite automata. The advantage of these automata is that they no longer need to “guess” the right transition to follow because, in a given state and for a given symbol of the input, there is at most one possible transition. As we have seen in the previous section, we can actually use a non-deterministic automaton without the need to guess. The algorithm we have seen actually allows us to dynamically construct a deterministic finite automaton. The advantage of starting directly from a deterministic automaton is that one does not need to reconstruct it each time. This will be all the more interesting for a lexical analyser because the regular expressions used to describe the lexical units will not change and we will gain efficiency by building once and for all the automata corresponding to the regular expressions.

The following automaton is a deterministic version of the non-deterministic automaton that recognizes the multiline comments in C in the previous section.

Since deterministic finite automata are a restriction of non-deterministic finite automata, one could rightly believe that they allow us to describe fewer languages. In fact this is not the case and they are as powerful as non-deterministic finite automata.

The video below will describe deterministic finite automata and show how to transform a non-deterministic automaton into a deterministic one.

The code used in the previous video is available here.

In the following video, we will show how a lexical analyzer works and how to obtain a deterministic finite automaton of minimal size.

In the following videos, we will code in C++ a lexical analyzer for the small language fragment described in the previous video. The code used in these videos is available here.

Questions

Let's consider the alphabet $\{a, b\}$. Construct an automaton that recognizes the language: $\{ w \in \{a, b\}^*\ |\ w$ contains an odd number of $a$ and an even number of $b \}$. For example, $abb$ is in the language, as well as $bbabbaa$ and $aaaa$, but not $b$ nor $aabb$.

Let's consider the regular expression $\color{darkgreen}{0^*(100^*)^*(1|\epsilon)}$ that describes the bit strings on the alphabet $\{0, 1\}$ not containing the substring $11$. Transform this regular expression into a non-deterministic finite automaton, then transform it into a deterministic finite automaton and finally minimize it.

The transformation of the regular expression into a non-deterministic finite automaton gives the following automaton. Note that we have taken some liberties with the transformations that we had seen in the video to reduce the size of the automaton a little bit, but the resulting automaton is very similar to what we had seen.

The corresponding deterministic finite automaton (using the transformation we have seen) is given below.

In this automaton, for example, the state $0$ corresponds to the set of states $\{0,1,3,4,10,11\}$ of the non-deterministic finite automaton and the state $4$ corresponds to the set $\{8,7,9,4,10,11\}$.

Now we just need to minimize that automaton. First, we will make explicit the dead state that we are going to denote by D, which is implicit in the figure above representing the deterministic automaton. It is the state that is reached on a transition that is not indicated in the automaton of the above figure. If we make this state explicit, we get the following equivalent automaton:

We will first consider the two sets of states that we can immediately distinguish: the terminal states and non-terminal states. This gives us the following two groups.

$G_1 = \{0,1,2,3,4\}$
$G_2 = \{D\}$

For the $G_1$ group, the states $0,1,3$ and $4$ have a transition to one of the $G_1$ group states on a 0 or 1. On the other hand, state $2$ have a transition to the $G_2$ group on a 1. So the group $G_1$ will have to be split. The group $G_2$ has only one element, so it remains unchanged. We now get the next three groups.

$G_{1,1} = \{0,1,3,4\}$
$G_{1,2} = \{2\}$
$G_2 = \{D\}$

In the group $G_{1,1}$, we have

$0 \xrightarrow[]{0} G_{1,1}$
$1 \xrightarrow[]{0} G_{1,1}$
$3 \xrightarrow[]{0} G_{1,1}$
$4 \xrightarrow[]{0} G_{1,1}$
$0 \xrightarrow[]{1} G_{1,2}$
$1 \xrightarrow[]{1} G_{1,2}$
$3 \xrightarrow[]{1} G_{1,2}$
$4 \xrightarrow[]{1} G_{1,2}$

The $G_{1,1}$ state does not need to be split any further because the transitions on 0 and 1 make each of the states of $G_{1,1}$ reach the same group. There is no more groups that can be split, so we are done minimizing. The states $0$, $1$, $3$ and $4$ will so be grouped together in one state. The automaton obtained after minimization is given below (we do not show the dead state).

If we interpret this automaton, we can see that state $0$ indicates that we have just seen a zero, or that we have not yet read anything. As for state $2$, it indicates that we have just encountered a $1$.

Transformation from an Automaton to a Regular Expression

We can automatically construct the regular expression corresponding to a finite automaton (deterministic or non-deterministic). We show below a series of transformations to go from the deterministic finite automaton corresponding to the comments in C seen above, to an equivalent regular expression. We will detail this transformation in the video below.

We can see, on the transitions, regular expressions appearing as the transformation progress. In order not to confuse the * character with the operator *, we write the operator in green.

First of all, we will rewrite the automaton by making clear the regular expressions representing the alternatives on the transitions.

We are now going to eliminate one state at a time to arrive at an automaton containing only two states: the initial state and the accepting state.

To eliminate the state $q = \{3,5,6\}$, for each pair of states $(q_1, q_2)$, if there is an arc between $q_1$ and $q$ and between $q$ and $q_2$, we need to keep this information by modifying the arc between $q_1$ and $q_2$ accordingly.

For example, here, we will have to consider the path $\{2,3,4,5\} \rightarrow q \rightarrow \{7\}$ and add the regular expression $**^{\color{darkgreen}{*}}/$ between the states $\{2,3,4,5\}$ and $\{7\}$ to keep the same information. We also need to consider the $\{2,3,4,5\} \rightarrow q \rightarrow \{2,3,4,5\}$ path and add the regular expression $**^{\color{darkgreen}{*}}{\color{darkgreen}{(}}a\mbox{ }{\color{darkgreen}{|}}\mbox{ }b{\color{darkgreen}{)}}$ on the loop on state $\{2,3,4,5\}$. We then get the following automaton.

By eliminating the state $\{2,3,4,5\}$ we get the next automaton.

And finally, by eliminating state $\{1\}$, we get the final regular expression on the arc connecting the initial state to the accepting state.

The following video will detail this construction.

The next videos will detail a program in OCaml to transform an automaton into a regular expression using the Floyd-Warshall algorithm.

The following video presents the Floyd-Warshall transitive closure algorithm on a graph to present the concepts more simply before moving on to the next video to the automatic creation of a regular expression from an automaton. The code presented in the video can be found here.

The following video describes the code that can transform an automaton into a regular expression. The code can be found here, and the little python script that transforms our representation into the expected one on this site can be found here.

Questions

Let's consider the alphabet $\{a, b\}$. Find a deterministic automaton that describes the language: $\{ w \in \{a, b\}^*\ |\ w$ contains an even number of $a\}$. Then transform this automaton into a regular expression.

Let's consider the alphabet $\{a, b\}$. Find a deterministic automaton to describe the language: $\{ w \in \{a, b\}^*\ |\ w$ does not contain an odd number of $a$ or does not contain an even number of $b\}$. Then, transform this automaton into a regular expression. Note that this language is complementary to the language $\{ w \in \{a, b\}^*\ |\ w$ contains an odd number of $a$ and an even number of $b\}$ that we have already seen.

Regular Expressions Matching

We are going to put into practice the notions we have just seen about regular expressions and automata by making a small application to test whether or not a string matches a pattern represented by a regular expression.

We will present a sequence of interactions in the OCaml interpreter that we will be able to do with this application.

1utop # let re = RE.regex_from_string "0*(100*)*1?";;
2val re : RE.regex =
3  RE.Concatenation (RE.ZeroOrMore (RE.CharSet <abstr>),
4   RE.Concatenation
5    (RE.ZeroOrMore
6      (RE.Concatenation (RE.CharSet <abstr>,
7        RE.Concatenation (RE.CharSet <abstr>,
8         RE.ZeroOrMore (RE.CharSet <abstr>)))),
9    RE.ZeroOrOne (RE.CharSet <abstr>)))

On line 1, we create the regular expression $\color{darkgreen}{0^*(100^*)^*(1|\epsilon)}$ which allows us to describe the words that do not contain consecutive 1. Note that we use the notation 1? to represent $\color{darkgreen}{(1|\epsilon)}$.

We then create an equivalent non-deterministic finite automaton.

utop # let nfa = NFA.init re;;
val nfa : NFA.t = <abstr>

We can then test if a string, here 101010, belongs or not to the language generated by the non-deterministic finite automaton and thus by the regular expression.

utop # NFA.full_match nfa "101010";;
- : bool = true

We can see in the following example, that the string 011111100100, containing consecutive 1, does not match the regular expression.

utop # NFA.full_match nfa "011111100100";;
- : bool = false

We can test if a substring is in a given string by surrounding an expression with .*. The . represents any character. The following example will define a regular expression to search for the substring Doc.

utop # let re = RE.regex_from_string ".*Doc.*";;
val re : RE.regex =
  RE.Concatenation (RE.ZeroOrMore (RE.CharSet <abstr>),
   RE.Concatenation (RE.CharSet <abstr>,
    RE.Concatenation (RE.CharSet <abstr>,
     RE.Concatenation (RE.CharSet <abstr>, RE.ZeroOrMore (RE.CharSet <abstr>)))))

utop # let dfa = DFA.init re;;
val dfa : DFA.t = <abstr>

utop # DFA.full_match dfa "Wait a minute, Doc. Ah... Are you telling me that you built a time machine... out of a DeLorean?";;
- : bool = true

utop # DFA.full_match dfa "The way I see it, if you're gonna build a time machine into a car, why not do it with some style?";;
- : bool = false

The code that will be explained in the following videos can be found here.

In the following video, we will present an overview of the application and detail the code to go from a string representing a regular expression to an OCaml representation of this regular expression. The grammar describing regular expressions can be found here.

In the following video, we present the notion of programming with continuation that we will use in conjonction with backtracking in the pattern recognition module. The code to illustrate continuations can be found here.

There is the solution to one of the questions we ask in the video at the end of the listing on continuations.

In the following video, we will describe the pattern recognition module based on backtracking and continuations.

In the following video, we will describe the pattern recognition module based on non-deterministic finite automata.

In the following video, we will describe the pattern recognition module based on deterministic finite automata.

In the following video, we will show how we have tested our different modules.

Questions

The code below describes the part of the regex_from_string function that handles concatenations.

 1and re1 l =
 2  let e, l = re2 l in
 3  let e, l =
 4    let rec re1' e l =
 5      match l with
 6      | '?' :: r -> re1' (ZeroOrOne e) r
 7      | '*' :: r -> re1' (ZeroOrMore e) r
 8      | '+' :: r -> re1' (OneOrMore e) r
 9      | _ -> e, l
10    in
11    re1' e l
12  in
13  match l with
14  | c :: _ when c <> ')' && c <> '|' ->
15     let e', l = re1 l in
16     Concatenation (e, e'), l
17  | _ ->
18     e, l

During the video, we said that there were two cases to handle to see if there were no new concatenations. To do so, we test on line 17 if the next character is a vertical bar or the closing parenthesis but there is another case, which the code handles well, but which we have not talked about. What is this other case?

For the backtracking pattern matching module, we saw that for the regular expression $\color{green}{(a?)^{40}a^{40}}$ (the superscript 40 indicates that the string is repeated forty times) and the input string $a^{40}$ we got a prohibitive execution time. Can you find another regular expression and another input string that would also result in a very long execution time?

In our DFA module, we memoized the transitions already seen thanks to the Memo module whose definition is given below.

 1module Memo =
 2  Map.Make(
 3      struct
 4        type t = S.t * char
 5        let compare (s1, c1) (s2, c2) =
 6          let res = compare c1 c2 in
 7          if res = 0 then
 8            S.compare s1 s2
 9          else
10            res
11      end)

The key comparison function in this table, defined from line 5 to line 10, may require to compare sets on line 8.
When one is in a given state of the deterministic finite automaton, it is sufficient to look whether the transition on a particular character has already been seen. There is therefore no need to compare sets and to have for each transition from the same state, a key that contains a state and a character.
To avoid creating a table that requires a pair of state and character as a key, one should associate to each state of the deterministic finite automaton (which is a set of states of the non-deterministic finite automaton) a table. The keys of this table will be characters, and the table will allow to store the transitions already encountered.
The new module that we want to realize is the following.

module DFA2 : Matching = struct
  (* TO DO *)
end

Your mission, if you choose to accept it, is to code the module to implement our new idea.

Lexical Analyzer with Ocamllex

We will now describe the MiniJava lexical analyzer (or scanner) which is created using ocamllex. The tool ocamllex is a lexical analyzer generator. You give it a list of regular expressions with actions to perform when a regular expression is recognized. The tool will then automatically generate a lexical analyzer that looks roughly like the lexer.cpp program that have we studied above.

The following program shows a MiniJava program, Lexical.java, which is invalid, but is nevertheless lexically correct.

class
/*/*/
public 123MrC00der;
while )(
{ int
int42
[]
// this sentence is false

If we execute the command ./mini-java --show-tokens-with-loc Lexical.java to run our mini-java transpiler with the option to output only the tokens produced by the lexical analyzer, we obtain the following tokens⁴.

CLASS
PUBLIC
INT_CONST ‘123‘
IDENT ‘MrC00der‘ ▸ line 3, char 11 ◂
SEMICOLON
WHILE
RPAREN
LPAREN
LBRACE
INTEGER
IDENT ‘int42‘ ▸ line 6, char 1 ◂
LBRACKET
RBRACKET
EOF

The tokens will be used by the syntax analyzer that we will study in the next chapter.

The following video will introduce ocamllex and describe the lexical analyzer of our transpiler. The code of the calculator in Reverse Polish Notation can be found here.

Questions

Suppose that in the lexical analyzer, the longest matching rule (the regular expression that matches the maximum number of characters is selected) is not used, but instead, the shortest matching rule is used. Why cannot we correctly recognize the tokens of MiniJava?

We want to write a program, using ocamllex, that replaces tabulations by four spaces and removes spaces and tabulations just before the end of line. For example, suppose we have a file.txt file whose content is shown below using the unix cat command to display tabulations, represented by ^I, and line breaks represented by $.

cat -ET file.txt
    I wish you^I ^I$
 a very happy new^I^I year^I  $
          ^I$
$

If the ocamllex file is called clean.mll, it is compiled as shown below.

ocamllex clean.mll
ocamlopt clean.ml -o clean

We will then use the clean program on a file.txt file, for example, as follows.

./clean < fichier.txt > res.txt

We will then get in the res.txt file, the contents of the file.txt file where the tabulations have been transformed into four spaces, and where the spaces and tabulations right before the end of line have been removed.

cat -ET res.txt
    I wish you$
 a very happy new         year$
$
$

Implement the program that replaces tabulations with four spaces and removes spaces and tabulations just before the end of line.

Ressources

Play with regular expressions
Test regular expressions
Generate test-cases for regular expressions
Transform regular expressions into automata
Russ Cox on regular expression matching
Learn OCaml
Try OCaml
Functional programming course using OCaml
OCaml documentation
Ocamllex documentation
Ocamllex in Real World OCaml
ISO C++
C++ core guidelines
C++ standard

The empty word is the equivalent of the string "". ^[return]
A regular expression must be of finite size. ^[return]
ndf for non-deterministic finite. ^[return]
More precisely, we obtain a representation of the tokens. ^[return]