Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Introduction

warning

Book rework in progress

In this book, we will follow my attempt at building a simple community driven programming language GitHub Repository. It is not my first time attempting such a project, but it is the first time doing it in a structured way while sharing the process and the code. The other attempts were more of a learning experience; one time producing a lexer and parser, another time producing a simple interpreter for a university language course. I am missing the whole picture, so I will try to cover all the steps this time. Moreover, I have never built a compiler, so this will be a new experience for me.

To specify some things before we start:

  • This project is not meant to be a learning resource, but rather a learning platform; the idea of sharing this experience is guided by the hope that it will help others learn from my mistakes and successes. I also hope to learn from suggestions and critics along the way myself.
  • This book will primarily show my reasoning process and the decisions I make about language design and implementation. I will try to keep the explanations linear and clear (don't make me promise).
  • I will try to keep the code as clean and readable as possible, but I cannot guarantee that it will be perfect.
  • I will be using Rust for reasons that we will discuss.
  • The implementation will be done from scratch -- as much as possible -- limiting the use of libraries to the bare minimum.
  • Calling it a book sounds pretentious; I will use this terminology purely because I'm using mdbook to write it and book appears to be the name used.

Lastly, I appreciate any feedback, suggestion, or contribution to the project. I will try to keep the code on GitHub up to date with the content of the book, so you can follow along with the code as we progress through the chapters.

Without further ado, let's get started!

Who am I?

I am a computer science student with a passion for programming languages and who cannot focus on one thing for too long. I have been learning about prgramming languages creation for a while (mainly for my degree) and I wanted to put my knowledge into practice.

Please note that English is not my first language, so there might be some mistakes in the text. I might confuse the usage of 'I' and 'we' to describe the subject behind the process, and I am sure that the sequence of tenses will not be perfect. I hope that at some point the usage of the word 'I' will be wrong due to the contributions of others, but for now, I will use it to refer to myself (at least in the Introduction).

I find criticism very helpful and I like environments where I can learn from others -- and hopefully where others can learn a thing or two from me -- so I will be happy to receive any feedback or suggestions.

Who are you?

If you found this book, there is the good chance that you are interested in language development and design. I do not want to be so presumptuous as to assume that you will learn from this resource even if you are a seasoned developer, but I hope that you will find something useful in it.

The target audience I have in mind is someone who has some programming experience, but is not necessarily an expert and, more importantly, who is quite new to the topic of programming languages.

How to read this book

The book is structured as a sort of diary journaling the development of Flavor.
For that reason the chapter will not follow a logical structure, but instead will follow the production timeline.

To maintain a structured system in the book, it will be structured as follows:

  • Language Development
    • Describes language development concepts and ideas.
  • Design
    • Outlines the design choices behind the Flavor language discussing their concequences and implications.
    • Flavor features
  • Implementation
    • Describes the implementation structure digging deep in the code.

Look out for the call-outs! I will insert call-outs in the book, each one means something.

note

Here I will include notes, mainly my reasonings and thoughts

question

Here I will note stuff to be studied more or that I do not fully understand yet

important

Here I will describe important sections to look for (usually highlighting possible mistakes)

Add more of those here...

Syntax highlighting! To highlight the syntax of Flavor throughout the examples and the snippets in this book, we will use other languages' highlighting (at least until I figure out how to create custom highlighting for Flavor) so it might be imperfect.

Some design choices

The first, and most predominant, choice is the language we will be using for development. I have chosen Rust for a few reasons. First, I am quite enjoying the language, I find its consistency and safety features very appealing; the strong typing system promotes good design at an early stage. There could be a counter argument to this, mainly saying that Rust puts so much focus on types that, after spending a lot of time resolving errors arising from types inconsistencies, you will have forgotten what the program was supposed to do. But it surely runs without any errors. I do not think that this is a problem, and even so, there are multiple languages one could use for this project so feel free to experiment. Second, Rust offers the possibility to write low-level code that, in the remote case that I manage to finish this project implementing a compiler, could be really useful. And lastly, Rust is a language that is gaining popularity and has a growing community, so I feel like it is a good choice given the shared nature of this project.

The second choice is about the decision to implement the language from scratch. I want to understand the process of language design and implementation; the goal is not to create a production-ready, feature-rich and performant language; the goal is, instead, to face the challenges someone before me would have faced, and to come up with solutions that I am reasonably happy with.

Flavor

Flavor, the language protagonist of this book.

Flavor will be an interpreted language (unless I manange to write a compiler) offering a strict typing system, type inference, function and lambdas, and -- possibly -- classes and objects. I will admit that the features Flavor will offer are to be decided and will greatly depend on my implementation. The few listed prior are just what I know that I want to tackle in this journey.

The aesthetic of Flavor will be designed with the concern of consistency in mind -- going back to the decision to use Rust in Some design choices. I have tried a not-so-small collection of languages, and I have found a series of elements that I like and don't like in the syntaxes I have learned. Flavor's syntax will be based on the intersection of those elements.

Roadmap and TODOs

Roadmap

  • Language Development
    • Overview
    • Syntax
    • Semantics
    • Grammar
    • Abstract Syntax Tree
    • Lexing
    • Parsing
    • Typing System
    • ...
  • Flavor Design
  • Implementation
    • Overview
    • Project Structure
    • Lexer
    • Parser
    • TypeChecker
    • ...

TODOs

Language Development

Design

The design of the language is a fun, imaginative step where we get to figure out how Flavor will look like and what will be the implications. I feel that in this step we are doing an exercise of creativity, software engineering, and problem solving.


Introduction to the Design Phase

Having already explored the fundamentals of language development — including syntax, semantics, parsing techniques, type systems, and other — we are now ready to turn our attention to the design phase. This phase is where the theoretical meets the practical: where the abstract concepts and principles previously discussed come together to shape the unique identity of Flavor.

Designing a programming language is both an art and a science. It requires balancing competing goals such as expressiveness, simplicity, performance, and usability.
Every design choice ripples through the language, affecting how users write code, how the compiler or interpreter processes it, and ultimately, how effective the language is at solving real-world problems. But no pressure.

In this section, we will expore the key design decisions that define Flavor. We will examine the rationale behind choosing specific language features, syntax styles, and semantic rules. Many of these decisions will reference concepts covered earlier in the book, such as:

  • How the syntax and grammar influence both the language’s readability and the complexity of the parser.
  • The impact of the type system on safety, flexibility, and developer experience.
  • Trade-offs in memory management strategies and how they affect performance and ease of use.
  • Approaches to error handling that balance robustness with clear, actionable feedback for programmers.
  • The paradigms and programming models that Flavor embraces, informed by our discussion on language paradigms.

Throughout this phase, we encourage a mindset of exploration and iteration. Language design, I find, is rarely linear or final on the first try. Instead, it evolves through experimentation, feedback, and reflection. By understanding the reasoning behind each choice, contributors and users alike will gain insight into the language’s goals and how they shape its implementation.

As we progress, you will see how these design decisions map directly to the codebase covered in the Implementation section, creating a cohesive narrative from concept to concrete realization.

note

I strongly encourage you to experiment! If you are following the production of your own fersion of Flavor with your own implementation, please explore other design choices and their consequences!

Flavor Imagining the First Design

The aim of this chapter is to provide an overview of the design of Flavor. We will not dive into implementation details or choices yet. Rather, we will imagine what we want Flavor to look like.

Disclaimer

This chapter is not intended to present the final syntax and semantics of Flavor. We are simply imagining the language, proposing ideas, and discussing them.

note

I find this imaginative step very useful. If, while designing, we find a particular element ugly or unpleasant to write, why include it? Without this step, identifying such elements would be more difficult. On the other hand, spending time imagining the language is a great way to explore possible features and additions. That said, the last thing I want is for this project to become overwhelmed with too many features. Each proposed feature will need to be explored and discussed to assess its feasibility and rationale. If these discussions produce positive expectations, we will try to implement the feature.

First Syntax Creation

I will leave the following text and code unchanged for the remainder of the book's creation process. For context, I wrote this content without much thought during the initial design phases. My intention in keeping it as is, basically without review, is to show the very first ideas that shaped the language's aesthetics. The consequences of this initial design are discussed in the next section.

In the following code snippet, we explore the syntax for Flavor.

example.flv
// variable declaration with explicit type
let x: int = 10;

// variable declaration with implicit type
let y = 20; // implicitly int

// TODO: check if we want implicit casting
// implicit casting
20 + 20.0 // int + float

print x;

if x > 2 {
    // code...
} else
{
    // code...
}

fn is_even (a: int) -> bool {
    return a % 2 == 0;
}

let lambda: int -> nothing = <x> {
    print x;
};

alias function fn;

function fibonacci (x:int) -> int {
    if x == 1 || x == 2 {
        return 1;
    } else {
        return fibonacci(x-1) + fibonacci(x-2);
    };
}

let fibonacci_lambda = <x:int> -> int { // fibonacci_lambda: <int> -> int
    if x == 1 || x == 2 {
        return 1;
    } else {
        return fibonacci_lambda(x-1) + fibonacci_lambda(x-2);
    };
};

Just by looking at this code, it is clear that many considerations need to be made.

Design Choices and Reasons

When discussing the aesthetic choices behind Flavor, I will focus on consistency.

A language is consistent if its behavior and syntax can be understood without needing to be explicitly taught. For example, if we always use ( ) to enclose parameters, . to access methods and properties, and [ ] to access elements in lists or sets, then we naturally expect to access the second element of a tuple using tuple[1], not tuple(1) or tuple.1. This is what I mean by consistency in a programming language, and all definitions and choices we make should follow this principle.

Moreover, the syntax should be fluent and enjoyable to write. To provide a practical example, consider these equivalent codes from various languages:

java
System.out.println("Hello World!");
javascript
console.log("Hello World!");
rust
println!("Hello World!");
lua
print "Hello World!"

All these codes perform the same task: a simple console print. As you can see, there are both simpler and more complex ways of writing this instruction.

Is there a best way to write the print command?
I cannot say that one is absolutely better than the others; however, I do have personal preferences. I like using the ! to indicate a macro in the Rust example (println!("...");), because it immediately clarifies what is being used (macro vs function call). But I also find the simplicity of Lua (print "...") very appealing.
I'm confident that for Flavor we will avoid taking inspiration from heavy languages like Java.

Aliasing

An interesting idea that came up while showing the syntax to a friend is the possibility to add aliases such as alias function fn, possibly even adding multiple aliases at once like alias [function, func] fn. This would allow users to customize the language themselves. I have not yet considered the consequences and implementation challenges of such a system. We can postpone these considerations until after we have a basic implementation.

Design Ideas Wrap Up

This step was intentionally exploratory; the idea was just to generate ideas. As of now, the only things I know I want to incorporate are:

  • some basic typing notation
  • variable declaration
  • functions and lambdas notation

Hopefully, that is enough to start having fun with development. In the next chapter, we will explore the first step of the implementation.

Language Features and Style

In designing a programming language, seemingly small choices—such as whether to use semicolons or parentheses in certain contexts—can have significant implications on the language’s readability, expressiveness, and user experience. These choices shape not only how code looks but also how it feels to write and maintain it.

In this chapter, we analyze the implications of such decisions and explore their impact on the language’s style and usability. We will begin by examining the use of semicolons and parentheses, then expand to other stylistic and syntactic choices that influence the overall flavor of the language.

Next, we will examine design choices related to implementation aspects, such as memory management, scopes, and more.

note

If you have a different perspective or think I missed something in any of the following points, please create an issue or contribute directly on GitHub.

Semicolons: To Use or Not to Use?

The semicolon (;) is a classic statement terminator in many languages such as C, Java, and Rust. Its presence can convey important semantic information and affect how code is parsed and understood.

Implications of Using Semicolons

  • Encourages Deliberate Writing: Like a period in a sentence, semicolons encourage the writer to consider the end of an action or intent.
  • Clear Statement Boundaries: Semicolons explicitly mark the end of a statement, helping both the compiler and the programmer to distinguish one instruction from another.
  • Facilitates Multiple Statements Per Line: Semicolons allow multiple statements on a single line, which can be convenient but may also reduce readability if overused.
  • Parsing Simplicity: From an implementation standpoint, semicolons can simplify parsing by clearly delimiting statements.

Implications of Omitting Semicolons

  • Cleaner Syntax: Languages like Python and Lua omit semicolons by default, resulting in cleaner, less cluttered code.
  • Reliance on Line Breaks: The parser must rely on line breaks or indentation to determine statement boundaries, which can sometimes lead to ambiguity or subtle bugs.
  • Potential for Ambiguity: Without explicit terminators, certain constructs might become harder to parse or require additional rules.

Choosing a Semicolon Policy for Flavor

For Flavor, using semicolons as statement terminators strikes a balance between clarity and expressiveness. It signals the end of an intention clearly while allowing flexibility in formatting. However, we may consider omitting semicolons in certain contexts (like the last statement in a block) to reduce verbosity—a feature common in modern languages.

Parentheses: When Are They Necessary?

Parentheses are traditionally used to enclose function parameters, but their usage can vary widely across languages.

Functions and Parentheses

  • Clarity of Invocation: Parentheses clearly distinguish function calls from variable references or other expressions.
  • Parameter Grouping: They group parameters and help the parser understand argument boundaries.
  • Optional Parentheses: Some languages (e.g., Lua) allow omitting parentheses for function calls with a single string or table argument, making code more concise.

Statements Without Parentheses

Consider the print statement in languages like Lua, where parentheses are optional or omitted entirely. This can make simple statements feel more natural and less cluttered. Also, ensuring the absence of parentheses help differentiate function calls from statements.

parentheses.flv
// function call
fibonacci(10);

// statement
print "Hello World!";

This approach allows a statement to accept only a single expression as an argument. However, a problem arises when the statement requires multiple arguments1.

For example, consider a dummy statement foo that needs two arguments. We can pass a tuple to keep the single expression as arugment for statements rule, resulting in the syntax foo (arg1, arg2). Note that this is not a function call, but a statement using a tuple as its parameter.

The syntax does not clearly distinguish between these two cases in such scenarios.
Should we keep this syntax as it is? Or, similar to Rust's approach, should we adopt an explicit notation to clearly differentiate these elements, like Rust does with macros using an exclamation mark (print!(...))?

Implications for Flavor

In Flavor, we propose requiring parentheses for function calls to maintain clarity and consistency, especially as functions become more complex. However, for certain built-in statements or commands like print, we may allow parentheses to be optional or omitted, enhancing code readability and fluency.

note

Currently, parentheses are expected only for function calls, not for statements.

Type Definitions: Style, Readability, and Power

When it comes to defining types in a programming language, the choices we make have a big impact on how the language feels to write and read. It’s not just about correctness or functionality — it’s about aesthetics, expressiveness, and the balance between simplicity and control.

Aesthetics: What Should Type Definitions Look Like?

Imagine you’re writing code and need to specify the type of a variable or function parameter. How do you want that to look?

  • Should it be verbose and explicit, like int, float, or string — simple, familiar words that almost feel like natural language?
  • Or should it be more precise and low-level, like i32, f64, or usize, which convey exact size and behavior but might feel more intimidating or technical?
  • Maybe a mix of both, where you can use simple names for everyday cases but also dive into detailed type specifications when you want?

For example, consider these two snippets:

Typing Example
let x: number = 10;
let y: int = 10;
let z: i32 = 10;

Both declare an integer variable, but the first feels more approachable and human-friendly, whereas the second gives you exact control over the integer’s size and representation. We could also use more general names like number, but this would require automatically converting between different types of numbers (such as floats, integers, and doubles) when performing operations.

Naming: Number vs Int, Float vs f32, f64, etc.

The choice of type names influences readability and clarity:

  • Simple names like number, int, float are easy to understand and remember — great for beginners and quick coding.
  • Detailed names like i32, u64, f64 provide precision, which is crucial when performance, memory layout, or interoperability matter.

Some languages (like Python or JavaScript) lean heavily on simple, abstract type names, hiding low-level details. Others (like Rust or C) expose those details upfront.

Learning from Other Languages

  • Rust uses explicit low-level types like i32 and f64, which can be daunting for beginners but powerful for systems programming.
  • Go uses simple names like int and float64, balancing ease of use with precision.
  • TypeScript offers a rich type system with friendly names and advanced features, making it approachable yet expressive.
  • Python largely hides type details but has recently introduced optional typing with simple syntax to improve readability.

Function Types

Functions are a core building block in any programming language, and how we define their types directly affects both readability and expressiveness. In Flavor, we want function type definitions to be clear, approachable, and consistent with the overall style of the language.

A typical function definition in Flavor looks like this:

flavor
fn foo(a: int, b: int) -> int {
  return a + b;
}

With this notation, the user is able to statically describe the return type of the function as well as the parameters type. This will allow us to specify static checks and to provide rich error messages to the user.

Wrapping Up

Type definitions are more than just syntax — they set the tone for how approachable or powerful the language feels.

In the next sections, we’ll explore how these type definitions tie into the language’s semantics and implementation, and how they influence error handling, inference, and more.


Implementation Choices: Memory Management, Scoping, and Error Handling

Now that we have explored the language’s syntax and design style, it’s time to consider some of the fundamental decisions that affect how Flavor will actually work under the hood. Since Flavor is intended to be an interpreted language (in its first incarnation at least), these choices shape the interpreter’s behavior and the programmer’s experience.

This section does not dive into code specifics but focuses on the concepts, trade-offs, and implications of different implementation strategies. The goal is to give you a clear understanding of the possibilities so you can confidently explore or even implement any of them yourself.

Memory Management

How a language manages memory is one of the most important design and implementation decisions. It affects performance, safety, ease of use, and complexity.

note

Explicit memory management policies may not be necessary in interpreted languages. However, since Flavor is designed as a learning platform rather than a practical programming language, we will include these policies in some form.

Manual Memory Management

  • The programmer is responsible for allocating and freeing memory explicitly.
  • Offers fine-grained control and potentially very efficient memory use.
  • However, it is error-prone: mistakes can cause memory leaks, dangling pointers, or crashes.
  • Languages like C use this model.

Garbage Collection (GC)

  • The interpreter automatically tracks and frees memory that is no longer in use.
  • Removes the burden of manual memory management from the programmer, reducing certain classes of bugs.
  • GC can introduce pauses or overhead, which might affect performance or responsiveness.
  • Languages like Java, JavaScript, and Python use GC.

There are different GC strategies:

  • Tracing GC: Periodically scans memory to find unreachable objects.
  • Reference Counting: Keeps counters for references and frees objects when count reaches zero.

Stack Allocation

  • Some values (like local variables) can be allocated on a stack with automatic cleanup when scope ends.
  • This is fast and simple but limited to values with clear lifetimes.
  • Often combined with GC or manual management for other cases.

Implications for Flavor

Choosing garbage collection (GC) simplifies programming for the user and fits well with the flexibility of an interpreted language. Among GC techniques, reference counting is easier to implement but has difficulties handling cyclical references.
Manual memory management offers powerful control but significantly increases complexity on the user hand and risk, making it less suitable for a language aimed at learning. However it is a fairly easy system to implement in its most basic form.
Meanwhile, stack allocation can improve performance for local variables but requires careful tracking of variable lifetimes to avoid errors.

Scoping

Scoping rules determine how and where variables and functions are visible and accessible, shaping both language semantics and implementation complexity.

Lexical (Static) Scoping

  • The scope of variables is determined by their position in the source code.
  • Most modern languages use lexical scoping.
  • Enables predictable bindings and supports closures (functions capturing variables from their defining environment).
  • Easier to reason about and implement in an interpreter.

Dynamic Scoping

  • Variable bindings are resolved by the call stack at runtime.
  • Can be more flexible but harder to predict and debug.
  • Rarely used in modern mainstream languages.

Nested Scopes and Closures

  • Supporting nested scopes allows functions to be defined inside other functions, capturing variables from outer scopes.
  • Closures are powerful for abstraction and functional programming styles.
  • Implementing closures requires the interpreter to maintain environments that outlive their original call frames.

Implications for Flavor

  1. Flavor adopts lexical (static) scoping, meaning that the visibility of variables is determined by their position in the source code rather than the call stack at runtime. This ensures that variables declared inside a block are only accessible within that block and any nested inner blocks.
  2. This scoping model enforces clear boundaries for variable visibility, preventing accidental access to variables that are out of scope. For example, the following code produces an error because y is used outside its declaring block:
Scoping Example
let x = 4;

if x > 2 {
  let y = 5;
  y;
}

print y; // Error: y is not defined here
  1. By making variable lifetimes explicit and predictable, lexical scoping improves both the language’s design and the programmer’s experience. Errors caused by undefined or shadowed variables can be detected early, helping users write safer and more maintainable code.
  2. Nested scopes and variable shadowing are naturally supported. Inner blocks can declare variables with the same name as outer ones, temporarily overriding the outer binding within that inner scope. This adds flexibility while maintaining clarity.

Error Handling

How a language handles errors influences code robustness and developer experience.

Compile-Time vs Runtime Errors

  • Compile-time errors catch problems before the program runs, improving safety but requiring more upfront checks.
  • Runtime errors occur during execution and must be handled gracefully to avoid crashes.

Exception Handling

  • Languages often provide constructs like try/catch to handle exceptions.
  • Enables separating error-handling code from normal logic but can complicate control flow.

Result Types and Explicit Handling

  • Some languages (like Rust) use types like Result to represent success or failure explicitly.
  • Encourages handling errors explicitly but can add verbosity.

Implications for Flavor

TODO...

Additional Style Decisions

Beyond semicolons and parentheses, other stylistic choices also shape the language’s character:

1. Block Delimiters

  • Using braces {} to delimit blocks is familiar and visually clear.
  • Alternatives include indentation-based blocks (like Python), which reduce punctuation but complicate parsing.

Summary

Small syntactic and stylistic choices have outsized effects on the language’s usability and identity. By carefully considering semicolon usage, parentheses, block delimiters, keywords, operators, and type annotations, we hope to better streamline the implementation process.

Each of these implementation choices—memory management, scoping, and error handling—comes with trade-offs that affect performance, safety, ease of use, and complexity. There is no one-size-fits-all solution, and part of the learning journey is understanding these trade-offs deeply.

By exploring these options thoroughly, anyone interested in programming language development can choose the path that best fits their goals, whether for Flavor or any other language project.

In the following chapters, we will explore these decisions in more detail and see how they influence both the language’s design and its implementation.


1: Calling them arguments is improper, but the term provides a good intuitive understanding of the concept.

Implementation

To start actually implementing the language we must discuss a bit about the elements that usually compose an engine.
This section will focus on a generic overview; the detail of the Flavor implementation are discussed in the dedicated chapters (i.e. Lexer).

A language's engine is usually divided into three sections:

  1. Lexical analisys
  2. Parsing
  3. Interpretation or Compilation

warning

These components and the way they are described here are just an introduction, surely not complete. There are multiple different ways one could go about the creation of a programming language.

From a high level prospective, we can think of these steps as follows:

  1. Formalizing the content of the source code in a structure we can work with
  2. Checking if the source code follows the grammar rules we defined and, if so, producing a custom representation of the code to encode important information
  3. Executing the logic described in the source code in a virtual environment (interpretation) or translating the source code into lower level code such as assembly (compilation)

We will go deeper into these topics as they become relevant in our work, but for now let's focus on the lexical analisys performed by the lexer.

The Lexer

As stated, the lexer is responsible for the formalization of the source code into a structure we can better work with programmatically. The usual choice for this structure is an array of tokens.

So what is a token?
We can define a token as a tuple containing an identifier and the source code that generated it. The code 3 + 4, once analyzed by the lexer, will produce the following token list:

tokens
token_list = [
    (NUMBER, "3"),
    (BIN_OP, "+"),
    (NUMBER, "4"),
]

Naturally, we must have defined the tokens previously to use them as shown. Here is where our design choices come into play for the first time: suppose we want to define a language that does not use the symbol ^; that is fair and it means that we will not define the token corresponding to that symbol.
This concept is expandible to more complex elements.

Let us elaborate on this example further so that we can reference it in a more complete manner later in this section. To do so we will use the more complex expression 3 + 4 * 5.

Lexing example
let code = "3 + 4 * 5";

let tokens = Lexer.lexe(code);

/* Tokens
(NUMBER, "3"),
(BIN_OP, "+"),
(NUMBER, "4"),
(BIN_OP, "*"),
(NUMBER, "5"),
*/

How the lexer does this processing to the source code is not that relevant. There are multiple ways in which we can acomplish this result.
The most common is the usage of single character scanning system. The logic is that the lexer scans the source code one character at the time and checks it to see if it correspond to a token or to the start of a token. For example, if the character is the symbol ;, we produce the token SEMICOLON If the character is the letter f, it might be part of an identifier. In this case, the lexer continues scanning the following characters (looking for alphanumeric ones) and, once finished, produces the token IDENTIFIER.

The Parser

The parser is the next component we employ for the processing of the source code.

Let us get an intuitive understanding of the parser's purpose. The goal is to construct what is known as an Abstract Syntax Tree (AST). An AST is a tree-like structure which encodes the tokens of the token list (generated by the lexer) with added information (for example, priority in the operations).
Keeping the example from before (3 + 4 * 5) we will expect the corresponding AST to be as follows:

AST
    (+)
    / \
   /   \
 [3]   (*)
       / \
     [4] [5]

What is the meaning of this structure?
Imagine exploring this tree from the root downwards.
First we encounter the addition node +, we know that the two operands of the addition are the children node of the +. We then go to the left child finding the number 3; 3 is a terminal element so we are done and we know the first of the two operands. Going then to the left element, we find the operator * for the multiplication. As already shown for the addition, we go look for the operands of the multiplication in the children nodes. We find two terminal nodes meaning that we know how to resolve the * operation (4 * 5 = 20). We now have the second operand for the addition (20), summing the two operands we obtain 23, correctly interpreting the tree.
We have done two things: first, congratulations, we have successfully simulated an interpreter; second, we have shown the importance of the precedence of operations that a tree encodes (we respect the arithmetical rules of resolving multiplications before additions).

Let us now understand how to build the parser so that it will produce the AST from the token list. We need to give meaning to the tokens, to do so we need a grammar. What the grammar does in this context is to provide rules to use to check the correctness of the tokens ordering.
This step is important so we will reiterate on it.

We use the grammar to specify how the tokens must appear in the token list. The simple grammar expr := NUMBER '+' NUMBER will accept the tokens [(NUMBER, "3"), (BIN_OP, "+"), (NUMBER, "4")] in turn accepting the code 3+4. Everything other than this specific elements will be regected by the grammar.
The parser then uses the grammar to check the correct order of the tokens. If the tokens are in a valid sequence, then the corresponding AST node is produced (in this case the addition node (+) with the two number children [3] [4]).

Lexer

Going back to the introduction to this chapter (Implementation), we reviewed the inteded behavior of the lexer, not its specific implementation for Flavor.
Here, in these sections, we will analyze the code behind Flavor and the reasonings that generated it.

In this chapter, we will create a lexer that uses regular expression (regex) rules.
The idea behind this lexer is to define a set of regex patterns that represent the tokens we want to identify.
When a regex matches a part of the code, we generate the corresponding token and add it to the token list.
For example, we can define the regex [0-9]+ to find numbers in the source code.
This regex is then tested against the source code, and if it matches a string like
146, the token (NUMBER, "146") is generated.

Setup

The creation of a lexer starts with the definition of the allowed syntax. In the case of Flavor, this step was done in the design phase.
Using the code as we imagined it, we can define this allowed syntax. For instance, notice the usage of the keyword let, the semicolon ; and the colon :. All these elements will correspond to tokens in the lexer. We can repeat the process to provide a complete list of tokens in an enum. We will also define the struct Token to represent the tuple for the token (as described in Implementation).

Development

First things first, we decided to use Rust for the project; we will then follow a structured approach to development: first comes the definition of the types, than their behaviors, and lastly we use them in the main code.

The types definition for the lexer are straightforward; we need the allowed tokens and a struct to represent the concept of a token.

The code necessary to do so is found in types.

Let us first define the TokenNames to check for in the lexer.

types.rs
pub enum TokenName {
    // Keywords
    Let, Fn, Alias,

    // Symbols
    Colon, Semicolon, // ...

    // Parentheses
    LPar, RPar, // ...

    // Complex Elements
    Number, Identifier, // ...

    // Utils
    Unknown, Eof,
}

We can then specify the Token structure.

types.rs
// ... Above

pub struct Token {
    pub tok_name: TokenName,
    pub lexeme: String,
}

The usage of a struct to define the touple for the token may seem overkill. That is because, at this stage, it is.
This is simply future thinking to allow for the inclusion of more complex elements as the column, row, and length of the lexeme. Those elements can be useful to provide nice errors to the user.

note

I've never implemented an error system for a language. This means that I am assuming the need for those more comlpex elements in the structure.
I could be mistaken.

Once the definition of the useful types is done, we can start implementing the logic.
We will create a new file called lexer.rs (lexer file) that we will use to create the lexer class and its workings.

To start let us define the structure for the lexer itself.

lexer.rs
use regex::Regex;

pub struct Lexer {
    pub tokens: Vec<Token>,
    pos: usize,
    source: String,
}

We leave the tokens public to access them from outside the lexer itself. The source attribute is used to store the source code to process. Lastly the pos element represents the starting position of the source code; this means that when a snippet is consumed by the lexer (see the next part to understand how) we will advance the position (pos) to indicate a new starting point for the lexer.

The workings of the lexer are as follows:

lexer.rs
// Above

impl Lexer {
    pub fn new(source_code: &str) -> Self {
        Lexer {
            tokens: vec![],
            pos: 0,
            source: source_code.to_string(),
        }
    }

    pub fn lexe(&mut self) {
        loop {
            let tok = self.next_token();

            self.tokens.push(tok.clone());
            if tok.tok_name == TokenName::Eof {
                break;
            }
        }
    }

    fn next_token(&mut self) -> Token {
        self.skip_whitespace();

        if self.pos >= self.source.len() {
            return Token {
                tok_name: TokenName::Eof,
                lexeme: "\0".to_string(),
            };
        }

        let mut tok = Token {
            tok_name: TokenName::Unknown,
            lexeme: "".to_string(),
        };

        let mut length_of_tok: usize = 0;

        let patterns = [
            (r"let", TokenName::Let),
            (r"fn", TokenName::Fn),
            (r"alias", TokenName::Alias),
            (r":", TokenName::Colon),
            (r";", TokenName::Semicolon),
            (r"==", TokenName::Eq),
            (r"\!=", TokenName::NotEq),
            (r"\=", TokenName::Assign),
            (r"\!", TokenName::Not),
            // ...
            (r"[0-9]+", TokenName::Number),
            (r"[a-zA-Z_][a-zA-Z0-9_]*", TokenName::Identifier),
            (r"[\s\S]*", TokenName::Unknown),
        ];

        for (pattern, token_name) in patterns.iter() {
            if let Some(lexeme) = self.match_start(pattern) {
                tok.tok_name = token_name.clone();
                tok.lexeme = lexeme.to_string();
                length_of_tok = lexeme.len();
                break;
            }
        }

        self.consume_n_char(length_of_tok);

        tok
    }

    fn match_start(&self, pattern: &str) -> Option<&str> {
        let re = Regex::new(pattern).unwrap();
        if let Some(mat) = re.find(self.remaining_source()) {
            if mat.start() == 0 {
                Some(mat.as_str())
            } else {
                None
            }
        } else {
            None
        }
    }

    fn skip_whitespace(&mut self) {
        let re = Regex::new(r"^\s+").unwrap();
        while let Some(m) = re.find(self.remaining_source()) {
            if m.start() == 0 {
                self.consume_n_char(m.end());
            } else {
                break;
            }
        }
    }

    fn remaining_source(&self) -> &str {
        &self.source[self.pos..]
    }

    fn consume_n_char(&mut self, n: usize) {
        self.pos += n;
    }
}

Let us break this code down.
The core of the logic is the next_token() method. This method is responsible for the production of the token. The logic is simple, the method checks the source code against known regex rules; if one maches, then the corresponding token is produced.
To do so, the method uses an array of touples (regex, token_name); a for loop then iterates over those touple and checks them against the source code.
The first that matches the source is used for the production of the token, the loop breaks and the source code start is shifted by however many characters have been matched.

important

The sequence in which we provide the regex rules is critical for the correct operation of the lexer. If we were to check the regex for the assignment (r"\=") before the regex for the equality (r"\==") we would never be able to correctly produce the token Eq resulting, instead, into two Assign tokens.
I find that this can be considered as a weakness of this kind of lexers

Surrounding this core logic we find a small collection of methods used to better follow the separation of concerns principle. Those methods are used to:

  • ignore whitespaces, tabs and other junk characters skip_whitespace()
  • consume the used characters in the source code advancing the starting position consume_n_char
  • match the source code with the regex rule match_start
  • get the remaining source code to process remaining_source

Lastly, to provide a cleaner code, we create another method called lexe() to serve as an entry point for the usage of the lexer. The body of the lexe() method consists only of the iterative call of the next_token() method untill the end of file. The tokens obtained in this manner are collected into the tokens property of the lexer to then be accessed.

At this stage we can put together a simple main() (main file) for the project to text the correctness of the lexer

main.rs
fn main() {
    let code = r#"let
    fn
    alias

    foo
    47

    : ; = == != ! -> => > < >= <=
    + - * / % ( ) [ ] { }

    let foo = 3;"#;

    let mut lexer = Lexer::new(code);
    lexer.lexe();

    lexer.tokens.iter().for_each(|tok| {
        println!("{:?}", tok);
    });
}

Running this code will provide the following output:

output of the main execution
Token { tok_name: Let, lexeme: "let" }
Token { tok_name: Fn, lexeme: "fn" }
Token { tok_name: Alias, lexeme: "alias" }
Token { tok_name: Identifier, lexeme: "foo" }
Token { tok_name: Number, lexeme: "47" }
Token { tok_name: Colon, lexeme: ":" }
Token { tok_name: Semicolon, lexeme: ";" }
Token { tok_name: Assign, lexeme: "=" }
Token { tok_name: Eq, lexeme: "==" }
Token { tok_name: NotEq, lexeme: "!=" }
Token { tok_name: Not, lexeme: "!" }
Token { tok_name: SlimArrow, lexeme: "->" }
Token { tok_name: BoldArrow, lexeme: "=>" }
Token { tok_name: Gt, lexeme: ">" }
Token { tok_name: Lt, lexeme: "<" }
Token { tok_name: Ge, lexeme: ">=" }
Token { tok_name: Le, lexeme: "<=" }
Token { tok_name: Plus, lexeme: "+" }
Token { tok_name: Minus, lexeme: "-" }
Token { tok_name: Times, lexeme: "*" }
Token { tok_name: Div, lexeme: "/" }
Token { tok_name: Percent, lexeme: "%" }
Token { tok_name: LPar, lexeme: "(" }
Token { tok_name: RPar, lexeme: ")" }
Token { tok_name: LSqu, lexeme: "[" }
Token { tok_name: RSqu, lexeme: "]" }
Token { tok_name: LBra, lexeme: "{" }
Token { tok_name: Rbra, lexeme: "}" }
Token { tok_name: Let, lexeme: "let" }
Token { tok_name: Identifier, lexeme: "foo" }
Token { tok_name: Assign, lexeme: "=" }
Token { tok_name: Number, lexeme: "3" }
Token { tok_name: Semicolon, lexeme: ";" }
Token { tok_name: Eof, lexeme: "\0" }

Indeed confirming the correctness of the lexing step.

Parser

In this chapter, we describe the creation of the parser to produce the ASTs from the token list.

Setup

Let us discuss the possible implementation choices for a parser. There are multiple routes we could take; the standard strategy involves the usa of some library for the parser creation (yacc is a good example). To stay true to the project philosophy, we will implement everything from scratch.
Let us imagine the solution to this problem.

We are working at a level of abstraction at this stage.
The most basic behavior that the parser must incorporate is the capability to check tokens to assess the correctness of the order in which they appear. We can logically conclude that we will use a method like expect_token(expected: token). Assuming the presence of such method, we can then imagine how we can use it. To check the tokens involved in the let definition, for instance, we can chain multiple calls of the expect_token method. This logic will be inserted in its method, let's call it parse_let(). This is where the problem arises.
The variables value, for the let definition, can be an expression which is composed of multiple tokens itself. This means that we will probably need a dedicated method to parse expressions (parse_expression()). But then again, the expression is a recursive definition (we can have the expression 3+3+3+3+..., meaning that to parse the operand of an operation we can encounter the operation itself as a child: recursion).
That is basically it, we just need to specify the precedence in which we want to evaluate the operations, and then recursively call the parsing functions accordingly.

important

This kind of parser is called Recursive Descent Parser

An important specification to do is related to the imperative structure of Flavor.
In the Flavor code we can define a sequence of instructions (statements separated by the semicolon). This structure is reflected on the implementation of the parser, which will produce a vector of ASTs (one per statement).

Development

Disclamer
Implementing a parser is challenging because it involves understanding and handling many elements and concepts at the same time. When building the parser from scratch, we will add each part step by step until the system is complete.
In this chapter, we will present the implementation and explanations of the key aspects, often including comments that guide you through the process as if we were building the parser together.

As we have already done for the lexer, we will start by defining the necessary types to then use in the parser (the code is again found in types).

More precisely, we are going to define the AST nodes first.

types.rs
// ... Above
pub enum ASTNode {
    Body {
        nodes: Vec<ASTNode>,
    },
    If {
        guard: Box<ASTNode>,
        then_body: Box<ASTNode>,
        else_body: Option<Box<ASTNode>>,
    },
    While {
        guard: Box<ASTNode>,
        body: Box<ASTNode>,
    },
    LetDeclaration {
        identifier: String,
        var_type: Option<Type>,
        expr: Box<ASTNode>,
    },
    FunctionDeclaration {
        name: String,
        parameters: Vec<(String, Type)>,
        return_type: Type,
        body: Box<ASTNode>,
    },
    Return(Box<ASTNode>),
    Break,
    FunctionCall {
        callee: Box<ASTNode>,
        arguments: Vec<ASTNode>,
    },
    UnitLiteral,
    NumberLiteral(String),
    StringLiteral(String),
    BoolLiteral(String),
    Identifier(String),
    ArrayAccess {
        array: Box<ASTNode>,
        index: Box<ASTNode>,
    },
    BinaryExpression {
        left: Box<ASTNode>,
        operator: String,
        right: Box<ASTNode>,
    },
    UnaryExpression {
        operator: String,
        operand: Box<ASTNode>,
        is_postfix: bool,
    },
    ExpressionStatement(Box<ASTNode>),
}

I understand that this collection of AST nodes might seem random or complicated. This is because, as mentioned earlier, in this chapter i have listed the parser in its final state. However, I did not implement everything all at once. This is a limitation of the book format.

Let us gather some key insights from this enum of ASTNodes. As discussed in Language Development we need to specify the nodes with which to represnet the elements present in Flavor. We will break down the ASTNode enum by analyzin g some examples. The literals get their specific node (Number -> NumberLiteral, True, False -> BooleanLiteral, ecc...). The literals can be used as operand in operations, both binary and unary (BinaryExpression, UnaryExpression). Then comes the statements

From this definition alone the recursive nature of the parser structure is apparent. Notice how the expression (expr) in the LetDeclaration node is itself of type ASTNode.

We can then define the parser as the following struct (code found in parser):

parser.rs
pub struct Parser {
    tokens: Vec<Token>,
    pos: usize,
}

We just need the token list to check and the current position we are checking.

The interesting part follows.
In the book we will report just a part of the parser highlighting the most important steps and elements. The entire code is found on the GitHub of the project (entirely open source, feel free to contribute).
We will define the entry point for the parser as a public method together with some helper functions.

parser.rs
// ... Above

impl Parser {
    // Helper
    fn current_tok(&self) -> &Token {
        &self.tokens[self.pos]
    }

    // Helper
    fn consume_tok(&mut self) {
        if self.pos < self.tokens.len() - 1 {
            self.pos += 1;
        }
    }

    // Helper
    fn expect_tok(&mut self, expected: TokenName) -> Result<Token, String> {
        let tok = self.current_tok();

        if tok.tok_name == expected {
            let tok = tok.clone();
            self.consume_tok();
            Ok(tok)
        } else {
            Err(format!(
                "Expected token {:?}, found {:?} ('{}')",
                expected, tok.tok_name, tok.lexeme
            ))
        }
    }

    // Entry Point
    pub fn parse_program(&mut self) -> Result<Vec<ASTNode>, String> {
        let mut nodes = Vec::new();
        while self.current_tok().tok_name != TokenName::Eof {
            nodes.push(self.parse_statement()?);
        }
        Ok(nodes)
    }
}

These helper methods allow us to get the current token, consume it (shift the list starting point right one position), and to check the token with a given TokenName.

Lastly, the parse_program() method will serve as an entry point for the parser. Notice the signature of the method which will return the vector of ASTs.

Now for the fun stuff, we will need the collection of parser functions to handle the different elements of the grammar.
First off, the parse_statement() method which is responsible for the parsing of all the different statements we support in Flavor.

parser.rs
fn parse_statement(&mut self) -> ParseProduction {
    match self.current_tok().tok_name {
        TokenName::Let => self.parse_let_statement(),
        // TODO: aliasing, fn declaration, return statement, ecc...
        _ => self.parse_expression_statement(),
    }
}

Currently we have little support yet, the let declaration is the first element in the list which would be followed by all the other statements. The default parsing will be parse_expression_statement() which is responsible for the handling of a statement as 3+4;.

Why do we need such statements without side effects? What are side effects? What is the difference between statement, expression and expression-statement? These are all topics which are discussed in chapter TODO.

Let us now analyze the method for the parsing of let declarations (parse_let_statement()). This will serve as an example to understand how we can apply the grammar in the parser to check the tokens. We can describe the let statement as follows: let identifier (':' identifier)? '=' <expr> ';'. This means that we expect, in order:

  1. the let token
  2. the identifier token
  3. optionally, the colon followed by another identifier
  4. the equal sign '=' token
  5. a collection of tokens composing an expression
  6. the semicolon ; token

To achieve this pattern of checks we implement the following code:

parser.rs
fn parse_let_statement(&mut self) -> Result<ASTNode, String> {
    self.expect_tok(TokenName::Let)?;
    let id_tok = self.expect_tok(TokenName::Identifier)?;

    let var_type: Option<String> = if self.current_tok().tok_name == TokenName::Colon {
        self.consume_tok();
        Some(self.parse_type()?)
    } else {
        None
    };

    self.expect_tok(TokenName::Assign)?;
    let expr = self.parse_expression()?;
    self.expect_tok(TokenName::Semicolon)?;

    // Return of the AST
    Ok(ASTNode::LetDeclaration {
        identifier: id_tok.lexeme,
        var_type,
        expr: Box::new(expr),
    })
}

Notice how the required tokens are checked in sequence. The usage of the ? is useful due to the return type of the parse functions. The functions will return a Result type; if an error is present, the ? symbol allows to escalate it to the caller.

For those of you that are reimplementing Flavor in other languages, this escalation system can be achieved by using throw and try-catch in Java for example.
Custom made solutions are also possible if not encouraged. We will experiment with a custom error handling and report system later in this book.

In this definition, the initialization value is required, while the declaration type is optional. This is shown in the implementation where the presence of the type is checked using an if statement, and the var_type variable is of type Option, meaning it can be None.

important

The expect_tok() method also returns a Result. That is so that the method will return the token if the check is positive and a structured error message if the check is negative. Having the method return a token allows for the caller to use it to compose the node of the AST.

In this example, the id_tok is stored with the return value of the expect_tok() after checking if the token is an identifier. The id_tok variable is then used to compile the LetDeclaration ASTNode.

The final important element in this example is the call to the parse_expression() method. As we said prior, the grammar expects an expression after the = symbol. The implementation will represent this with the calling of the parse_expression() method.
To make it absolutely clear, the structure of the parse_expression() method will reseamble the on in parse_let_statement(). Also checking the token sequence and calling other parse methods as necessary to then compose an AST. The AST obtained from the parse_expression() will be then used in the LetDeclaration AST node in the example.

Operator Precedence

We have talked about precedence in an informal way. In the context of operations, the precedence is a value to represent the ordering in which to execute said operations.

We represent this ordering with the following function.

parser.rs
// ... Above
fn get_precedence(token: &Token) -> Option<u8> {
    match token.tok_name {
        TokenName::Plus | TokenName::Minus => Some(10),
        TokenName::Times | TokenName::Div | TokenName::Percent => Some(20),
        TokenName::Eq | TokenName::NotEq => Some(5),
        _ => None,
    }
}

With this system we can associate to each operation an priority value and we can then parse those operations accordingly.

The function responsible for the parsing is the following.

parser.rs
fn parse_binary_expression(&mut self, min_prec: u8) -> ParseProduction {
    let mut left = self.parse_postfix_expression()?;

    while let Some(prec) = Self::get_precedence(self.current_tok()) {
        if prec < min_prec {
            break;
        }

        let op_tok = self.current_tok().clone();
        self.consume_tok();

        let right = self.parse_binary_expression(prec + 1)?;
        left = ASTNode::BinaryExpression {
            left: Box::new(left),
            operator: op_tok.lexeme,
            right: Box::new(right),
        }
    }
    Ok(left)
}

The important detail in this function are the usage of the recursion calling the parse_binary_expression() method itself for the right operand of the operation; and the usage of a precedence check to correctly stop the evaluation of the operations. The recursive call of the method will increase the precedence value to ensure that the right child of an operation node is always of an higher precedence.

We will leave out all the parsing methods. Just one more will be included in the book. Specifically, the parsing method responsible for the parsing of terminal nodes.

parser.rs
fn parse_primary(&mut self) -> ParseProduction {
    let tok = self.current_tok().clone();
    match tok.tok_name {
        TokenName::Number => {
            self.consume_tok();
            Ok(ASTNode::NumberLiteral(tok.lexeme))
        }
        TokenName::Identifier => {
            self.consume_tok();
            Ok(ASTNode::Identifier(tok.lexeme))
        }
        TokenName::LPar => {
            self.consume_tok();
            let expr = self.parse_expression()?;
            self.expect_tok(TokenName::RPar)?;
            Ok(expr)
        }
        _ => Err(format!("Unexpected token in expression: {:?}", tok)),
    }
}

Here the parsing is simpler; the token gets checked. If it is a terminal token, then the corresponding AST node is returned. If, instead, it is a parenthesis, then we recursively call the parse_expression() method restarting the cycle.