Implementation
To start actually implementing the language we must discuss a bit about the
elements that usually compose an engine.
This section will focus on a generic overview; the detail of the Flavor
implementation are discussed in the dedicated chapters (i.e. Lexer).
A language's engine is usually divided into three sections:
- Lexical analisys
- Parsing
- Interpretation or Compilation
warning
These components and the way they are described here are just an introduction, surely not complete. There are multiple different ways one could go about the creation of a programming language.
From a high level prospective, we can think of these steps as follows:
- Formalizing the content of the source code in a structure we can work with
- Checking if the source code follows the grammar rules we defined and, if so, producing a custom representation of the code to encode important information
- Executing the logic described in the source code in a virtual environment (interpretation) or translating the source code into lower level code such as assembly (compilation)
We will go deeper into these topics as they become relevant in our work, but for now let's focus on the lexical analisys performed by the lexer.
The Lexer
As stated, the lexer is responsible for the formalization of the source code into a structure we can better work with programmatically. The usual choice for this structure is an array of tokens.
So what is a token?
We can define a token as a tuple containing an identifier and the source code
that generated it.
The code 3 + 4
, once analyzed by the lexer, will produce the following token
list:
token_list = [
(NUMBER, "3"),
(BIN_OP, "+"),
(NUMBER, "4"),
]
Naturally, we must have defined the tokens previously to use them as shown.
Here is where our design choices come into play for the first time: suppose we
want to define a language that does not use the symbol ^
; that is fair and it
means that we will not define the token corresponding to that symbol.
This concept is expandible to more complex elements.
Let us elaborate on this example further so that we can reference it in a more
complete manner later in this section.
To do so we will use the more complex expression 3 + 4 * 5
.
let code = "3 + 4 * 5";
let tokens = Lexer.lexe(code);
/* Tokens
(NUMBER, "3"),
(BIN_OP, "+"),
(NUMBER, "4"),
(BIN_OP, "*"),
(NUMBER, "5"),
*/
How the lexer does this processing to the source code is not that relevant.
There are multiple ways in which we can acomplish this result.
The most common is the usage of single character scanning system.
The logic is that the lexer scans the source code one character at the time and
checks it to see if it correspond to a token or to the start of a token.
For example, if the character is the symbol ;
, we produce the token SEMICOLON
If the character is the letter f
, it might be part of an identifier.
In this case, the lexer continues scanning the following characters (looking for alphanumeric ones) and,
once finished, produces the token IDENTIFIER
.
The Parser
The parser is the next component we employ for the processing of the source code.
Let us get an intuitive understanding of the parser's purpose.
The goal is to construct what is known as an Abstract Syntax Tree (AST).
An AST is a tree-like structure which encodes the tokens of the token list
(generated by the lexer) with added information (for example, priority in the
operations).
Keeping the example from before (3 + 4 * 5
) we will expect the corresponding
AST to be as follows:
(+)
/ \
/ \
[3] (*)
/ \
[4] [5]
What is the meaning of this structure?
Imagine exploring this tree from the root downwards.
First we encounter the addition node +
, we know that the two operands of
the addition are the children node of the +
.
We then go to the left child finding the number 3
; 3
is a terminal element
so we are done and we know the first of the two operands.
Going then to the left element, we find the operator *
for the multiplication.
As already shown for the addition, we go look for the operands of the
multiplication in the children nodes.
We find two terminal nodes meaning that we know how to resolve the *
operation
(4 * 5 = 20
).
We now have the second operand for the addition (20
), summing the two operands
we obtain 23
, correctly interpreting the tree.
We have done two things: first, congratulations, we have successfully simulated
an interpreter; second, we have shown the importance of the precedence of
operations that a tree encodes (we respect the arithmetical rules of resolving
multiplications before additions).
Let us now understand how to build the parser so that it will produce the AST
from the token list.
We need to give meaning to the tokens, to do so we need a grammar.
What the grammar does in this context is to provide rules to use to check the
correctness of the tokens ordering.
This step is important so we will reiterate on it.
We use the grammar to specify how the tokens must appear in the token list.
The simple grammar expr := NUMBER '+' NUMBER
will accept the tokens [(NUMBER, "3"), (BIN_OP, "+"), (NUMBER, "4")]
in turn accepting the code 3+4
.
Everything other than this specific elements will be regected by the grammar.
The parser then uses the grammar to check the correct order of the tokens.
If the tokens are in a valid sequence, then the corresponding AST node is
produced (in this case the addition node (+)
with the two number children [3]
[4]
).