Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Implementation

To start actually implementing the language we must discuss a bit about the elements that usually compose an engine.
This section will focus on a generic overview; the detail of the Flavor implementation are discussed in the dedicated chapters (i.e. Lexer).

A language's engine is usually divided into three sections:

  1. Lexical analisys
  2. Parsing
  3. Interpretation or Compilation

warning

These components and the way they are described here are just an introduction, surely not complete. There are multiple different ways one could go about the creation of a programming language.

From a high level prospective, we can think of these steps as follows:

  1. Formalizing the content of the source code in a structure we can work with
  2. Checking if the source code follows the grammar rules we defined and, if so, producing a custom representation of the code to encode important information
  3. Executing the logic described in the source code in a virtual environment (interpretation) or translating the source code into lower level code such as assembly (compilation)

We will go deeper into these topics as they become relevant in our work, but for now let's focus on the lexical analisys performed by the lexer.

The Lexer

As stated, the lexer is responsible for the formalization of the source code into a structure we can better work with programmatically. The usual choice for this structure is an array of tokens.

So what is a token?
We can define a token as a tuple containing an identifier and the source code that generated it. The code 3 + 4, once analyzed by the lexer, will produce the following token list:

tokens
token_list = [
    (NUMBER, "3"),
    (BIN_OP, "+"),
    (NUMBER, "4"),
]

Naturally, we must have defined the tokens previously to use them as shown. Here is where our design choices come into play for the first time: suppose we want to define a language that does not use the symbol ^; that is fair and it means that we will not define the token corresponding to that symbol.
This concept is expandible to more complex elements.

Let us elaborate on this example further so that we can reference it in a more complete manner later in this section. To do so we will use the more complex expression 3 + 4 * 5.

Lexing example
let code = "3 + 4 * 5";

let tokens = Lexer.lexe(code);

/* Tokens
(NUMBER, "3"),
(BIN_OP, "+"),
(NUMBER, "4"),
(BIN_OP, "*"),
(NUMBER, "5"),
*/

How the lexer does this processing to the source code is not that relevant. There are multiple ways in which we can acomplish this result.
The most common is the usage of single character scanning system. The logic is that the lexer scans the source code one character at the time and checks it to see if it correspond to a token or to the start of a token. For example, if the character is the symbol ;, we produce the token SEMICOLON If the character is the letter f, it might be part of an identifier. In this case, the lexer continues scanning the following characters (looking for alphanumeric ones) and, once finished, produces the token IDENTIFIER.

The Parser

The parser is the next component we employ for the processing of the source code.

Let us get an intuitive understanding of the parser's purpose. The goal is to construct what is known as an Abstract Syntax Tree (AST). An AST is a tree-like structure which encodes the tokens of the token list (generated by the lexer) with added information (for example, priority in the operations).
Keeping the example from before (3 + 4 * 5) we will expect the corresponding AST to be as follows:

AST
    (+)
    / \
   /   \
 [3]   (*)
       / \
     [4] [5]

What is the meaning of this structure?
Imagine exploring this tree from the root downwards.
First we encounter the addition node +, we know that the two operands of the addition are the children node of the +. We then go to the left child finding the number 3; 3 is a terminal element so we are done and we know the first of the two operands. Going then to the left element, we find the operator * for the multiplication. As already shown for the addition, we go look for the operands of the multiplication in the children nodes. We find two terminal nodes meaning that we know how to resolve the * operation (4 * 5 = 20). We now have the second operand for the addition (20), summing the two operands we obtain 23, correctly interpreting the tree.
We have done two things: first, congratulations, we have successfully simulated an interpreter; second, we have shown the importance of the precedence of operations that a tree encodes (we respect the arithmetical rules of resolving multiplications before additions).

Let us now understand how to build the parser so that it will produce the AST from the token list. We need to give meaning to the tokens, to do so we need a grammar. What the grammar does in this context is to provide rules to use to check the correctness of the tokens ordering.
This step is important so we will reiterate on it.

We use the grammar to specify how the tokens must appear in the token list. The simple grammar expr := NUMBER '+' NUMBER will accept the tokens [(NUMBER, "3"), (BIN_OP, "+"), (NUMBER, "4")] in turn accepting the code 3+4. Everything other than this specific elements will be regected by the grammar.
The parser then uses the grammar to check the correct order of the tokens. If the tokens are in a valid sequence, then the corresponding AST node is produced (in this case the addition node (+) with the two number children [3] [4]).