Language Development
To start actually implementing the language we must discuss a bit about the
elements that usually compose an engine.
This section will focus on a generic overview; the detail of the Flavor
implementation are discussed in the dedicated chapters (i.e. implementation)
A language's engine is usually divided into three sections:
- Lexical analisys
- Parsing
- Interpretation or Compilation
We will divide these phases into processing, which includes steps 1 and 2, and evaluation, which corresponds to step 3.
warning
These components and the way they are described here are just an introduction, surely not complete. There are multiple different ways one could go about the creation of a programming language.
From a high level prospective, we can think of these steps as follows:
- Formalizing the content of the source code in a structure we can work with
- Checking if the source code follows the grammar rules we defined and, if so, producing a custom representation of the code to encode important information
- Executing the logic described in the source code in a virtual environment (interpretation) or translating the source code into lower level code such as assembly (compilation)
We will go deeper into these topics as they become relevant in our work, but for now let's focus on the lexical analisys performed by the lexer.
There are several additional components commonly used when creating a language. Some examples include the type checker and the error handler. These components will be introduced throughout this and the following chapters, as they become relevant.
The Lexer
As stated, the lexer is responsible for the formalization of the source code into a structure we can better work with programmatically. The usual choice for this structure is an array of tokens.
So what is a token?
We can define a token as a tuple containing an identifier and the source code
that generated it.
Intuitivly, a token is used to represent the pieces that compose the code.
A token should represent the element of the language independently of their
meaning.
The code 3 + 4, once analyzed by the lexer, will produce the following token
list:
token_list = [
(NUMBER, "3"),
(BIN_OP, "+"),
(NUMBER, "4"),
]
Naturally, we must have defined the avaliable token names previously to use them as shown.
Here is where our design choices come into play for the first time: suppose we
want to define a language that does not use the symbol ^; that is fair and it
means that we will not define the token corresponding to that symbol.
This concept is expandible to more complex elements.
Let us elaborate on this example further so that we can reference it in a more
complete manner later in this section.
To do so we will use the more complex expression 3 + 4 * 5.
let code = "3 + 4 * 5";
let tokens = Lexer.lexe(code);
/* Tokens
(NUMBER, "3"),
(BIN_OP, "+"),
(NUMBER, "4"),
(BIN_OP, "*"),
(NUMBER, "5"),
*/
How the lexer does this processing to the source code is not that relevant.
There are multiple ways in which we can accomplish this result.
The most common is the usage of single character scanning system.
The logic is that the lexer scans the source code one character at the time and
checks it to see if it corresponds to a token or to the start of a token.
For example, if the character is the symbol ;, we produce the token SEMICOLON
If the character is the letter f, it might be part of an identifier.
In this case, the lexer continues scanning the following characters (looking for alphanumeric ones) and,
once finished, produces the token IDENTIFIER.
The Parser
The parser is the next component we employ for the processing of the source code.
Let us get an intuitive understanding of the parser's purpose.
The goal is to construct what is known as an Abstract Syntax Tree (AST).
An AST is a tree-like structure which encodes the tokens of the token list
(generated by the lexer) with added information (for example, priority in the
operations).
Keeping the example from before (3 + 4 * 5) we will expect the corresponding
AST to be as follows:
(+)
/ \
/ \
[3] (*)
/ \
[4] [5]
What is the meaning of this structure?
Imagine exploring this tree from the root downwards.
First we encounter the addition node +, we know that the two operands of
the addition are the children node of the +.
We then go to the left child finding the number 3; 3 is a terminal element
so we are done and we know the first of the two operands.
Going then to the right element, we find the operator * for the multiplication.
As already shown for the addition, we go look for the operands of the
multiplication in the children nodes.
We find two terminal nodes meaning that we know how to resolve the * operation
(4 * 5 = 20).
We now have the second operand for the addition (20), summing the two operands
we obtain 23, correctly interpreting the tree.
We have done two things: first, congratulations, we have successfully simulated
an interpreter; second, we have shown the importance of the precedence of
operations that a tree encodes (we respect the arithmetical rules of resolving
multiplications before additions).
Let us now understand how to build the parser so that it will produce the AST from the token list. We need to give meaning to the tokens, to do so we need a grammar. In this context, a grammar is a set of rules that define how the tokens can be combined to form valid expressions in the language. This step is important so we will reiterate on it.
We use the grammar to specify how the tokens must appear in the token list.
The simple grammar expr := NUMBER '+' NUMBER will accept the tokens [(NUMBER, "3"), (BIN_OP, "+"), (NUMBER, "4")] in turn accepting the code 3+4.
Anything other than this specific elements will be regected by the grammar.
The parser then uses the grammar to check the correct order of the tokens.
If the tokens are in a valid sequence, then the corresponding AST node is
produced (in this case the addition node (+) with the two number children [3] [4]).
Type Checking
Let's discuss the importance of the typechecker.
When designing a programming language, we can choose how strict its typing rules
should be. However, every language (even those without explicit types) has some
form of typing system to ensure operations are evaluated correctly and
predictably.
To illustrate this, consider adding an integer and a decimal number, like 3 + 4.2.
In a strictly typed language, where integers and floats are treated as distinct
types, this operation might be disallowed.
On the other hand, in a language without explicit typing, the operation might be
allowed and produce a result like 7.2.
These different approaches are called policies, and language designers have
the freedom to define them as they see fit.
For example, in JavaScript, the expression "Hello " + 3 is valid and results
in the string "Hello 3".
The type checker ensures that the program follows specific rules to guarantee predictable behavior during runtime and to provide detailed error messages during compilation.
Let's explore how a type checker might work. We can design it to analyze the Abstract Syntax Tree (AST) by examining its contents and identifying any inconsistencies. Since the AST is a tree structure, we can use existing tree traversal algorithms to navigate it.
To keep things simple, imagine the type checker as a recursive function that uses some persistent state to perform the necessary checks and enforce the rules.
The type checker works independently of the language's typing system and, in my opinion, enables the greatest flexibility during the language design phase.