Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Language Development

To start actually implementing the language we must discuss a bit about the elements that usually compose an engine.
This section will focus on a generic overview; the detail of the Flavor implementation are discussed in the dedicated chapters (i.e. implementation)

A language's engine is usually divided into three sections:

  1. Lexical analisys
  2. Parsing
  3. Interpretation or Compilation

We will divide these phases into processing, which includes steps 1 and 2, and evaluation, which corresponds to step 3.

warning

These components and the way they are described here are just an introduction, surely not complete. There are multiple different ways one could go about the creation of a programming language.

From a high level prospective, we can think of these steps as follows:

  1. Formalizing the content of the source code in a structure we can work with
  2. Checking if the source code follows the grammar rules we defined and, if so, producing a custom representation of the code to encode important information
  3. Executing the logic described in the source code in a virtual environment (interpretation) or translating the source code into lower level code such as assembly (compilation)

We will go deeper into these topics as they become relevant in our work, but for now let's focus on the lexical analisys performed by the lexer.

There are several additional components commonly used when creating a language. Some examples include the type checker and the error handler. These components will be introduced throughout this and the following chapters, as they become relevant.

The Lexer

As stated, the lexer is responsible for the formalization of the source code into a structure we can better work with programmatically. The usual choice for this structure is an array of tokens.

So what is a token?
We can define a token as a tuple containing an identifier and the source code that generated it. Intuitivly, a token is used to represent the pieces that compose the code. A token should represent the element of the language independently of their meaning. The code 3 + 4, once analyzed by the lexer, will produce the following token list:

tokens
token_list = [
    (NUMBER, "3"),
    (BIN_OP, "+"),
    (NUMBER, "4"),
]

Naturally, we must have defined the avaliable token names previously to use them as shown. Here is where our design choices come into play for the first time: suppose we want to define a language that does not use the symbol ^; that is fair and it means that we will not define the token corresponding to that symbol.
This concept is expandible to more complex elements.

Let us elaborate on this example further so that we can reference it in a more complete manner later in this section. To do so we will use the more complex expression 3 + 4 * 5.

Lexing example
let code = "3 + 4 * 5";

let tokens = Lexer.lexe(code);

/* Tokens
(NUMBER, "3"),
(BIN_OP, "+"),
(NUMBER, "4"),
(BIN_OP, "*"),
(NUMBER, "5"),
*/

How the lexer does this processing to the source code is not that relevant. There are multiple ways in which we can accomplish this result.
The most common is the usage of single character scanning system. The logic is that the lexer scans the source code one character at the time and checks it to see if it corresponds to a token or to the start of a token. For example, if the character is the symbol ;, we produce the token SEMICOLON If the character is the letter f, it might be part of an identifier. In this case, the lexer continues scanning the following characters (looking for alphanumeric ones) and, once finished, produces the token IDENTIFIER.

The Parser

The parser is the next component we employ for the processing of the source code.

Let us get an intuitive understanding of the parser's purpose. The goal is to construct what is known as an Abstract Syntax Tree (AST). An AST is a tree-like structure which encodes the tokens of the token list (generated by the lexer) with added information (for example, priority in the operations).
Keeping the example from before (3 + 4 * 5) we will expect the corresponding AST to be as follows:

AST
    (+)
    / \
   /   \
 [3]   (*)
       / \
     [4] [5]

What is the meaning of this structure?
Imagine exploring this tree from the root downwards.
First we encounter the addition node +, we know that the two operands of the addition are the children node of the +. We then go to the left child finding the number 3; 3 is a terminal element so we are done and we know the first of the two operands. Going then to the right element, we find the operator * for the multiplication. As already shown for the addition, we go look for the operands of the multiplication in the children nodes. We find two terminal nodes meaning that we know how to resolve the * operation (4 * 5 = 20). We now have the second operand for the addition (20), summing the two operands we obtain 23, correctly interpreting the tree.
We have done two things: first, congratulations, we have successfully simulated an interpreter; second, we have shown the importance of the precedence of operations that a tree encodes (we respect the arithmetical rules of resolving multiplications before additions).

Let us now understand how to build the parser so that it will produce the AST from the token list. We need to give meaning to the tokens, to do so we need a grammar. In this context, a grammar is a set of rules that define how the tokens can be combined to form valid expressions in the language. This step is important so we will reiterate on it.

We use the grammar to specify how the tokens must appear in the token list. The simple grammar expr := NUMBER '+' NUMBER will accept the tokens [(NUMBER, "3"), (BIN_OP, "+"), (NUMBER, "4")] in turn accepting the code 3+4. Anything other than this specific elements will be regected by the grammar.
The parser then uses the grammar to check the correct order of the tokens. If the tokens are in a valid sequence, then the corresponding AST node is produced (in this case the addition node (+) with the two number children [3] [4]).

Type Checking

Let's discuss the importance of the typechecker.

When designing a programming language, we can choose how strict its typing rules should be. However, every language (even those without explicit types) has some form of typing system to ensure operations are evaluated correctly and predictably. To illustrate this, consider adding an integer and a decimal number, like 3 + 4.2. In a strictly typed language, where integers and floats are treated as distinct types, this operation might be disallowed. On the other hand, in a language without explicit typing, the operation might be allowed and produce a result like 7.2. These different approaches are called policies, and language designers have the freedom to define them as they see fit. For example, in JavaScript, the expression "Hello " + 3 is valid and results in the string "Hello 3".

The type checker ensures that the program follows specific rules to guarantee predictable behavior during runtime and to provide detailed error messages during compilation.

Let's explore how a type checker might work. We can design it to analyze the Abstract Syntax Tree (AST) by examining its contents and identifying any inconsistencies. Since the AST is a tree structure, we can use existing tree traversal algorithms to navigate it.

To keep things simple, imagine the type checker as a recursive function that uses some persistent state to perform the necessary checks and enforce the rules.

The type checker works independently of the language's typing system and, in my opinion, enables the greatest flexibility during the language design phase.