I wrote a programming language. Here‘s how you can, too.

Over the past decade, we‘ve seen an explosion of new programming languages like Go, Rust, Swift, Kotlin, and TypeScript. But have you ever wondered what it takes to create a brand new programming language from scratch?

For the past 6 months, I‘ve been working on a new language called Pinecone. It‘s been a challenging but incredibly rewarding experience. I‘ve had to learn a lot about programming language theory and stretch my skills in new directions.

In this post, I want to pull back the curtain and walk through the key steps involved in breathing life into a new programming language. I‘ll share lessons I‘ve learned along the way and point you to resources for digging deeper. My hope is to demystify the process and perhaps inspire you to try your hand at creating a language of your own, even if just as a learning exercise.

Why create your own language?

Before we dive in, you might be wondering what the point is of creating a new programming language in the first place. After all, don‘t we already have enough of them?

While it‘s true that we probably don‘t need yet another general-purpose language, there are still plenty of good reasons to create your own:

  1. To really understand how languages work under the hood. Being able to say you‘ve written a language is the ultimate test of whether you grok programming language concepts.

  2. To create a domain-specific language (DSL) for a particular problem space. DSLs can offer a more natural and expressive way to solve problems in areas like data querying, graphics, or scientific computing.

  3. To experiment with new language features or paradigms. Want to add a radical new feature to an existing language? Try prototyping it in a simplified language of your own.

  4. Because it‘s fun! Language hacking can be an engaging puzzle for the programming language geeks among us.

Now that you‘re hopefully convinced, let‘s walk through the stages your language will need to include.

The Programming Language Pipeline

At a high level, most programming languages can be broken down into the following pipeline stages:

  1. Lexing (aka tokenization): Breaking the raw source code text into a stream of tokens
  2. Parsing: Converting the linear token stream into a tree structure called an Abstract Syntax Tree (AST)
  3. Semantic Analysis: Traversing the AST to build symbol tables and perform type checking and other validity checks
  4. Code Generation: Converting the AST into lower-level code, such as assembly, bytecode, or another high-level language
  5. Optimization: Analyzing and transforming the low-level code to improve performance

We‘ll now look at each of these stages in turn. I‘ll explain the key concepts, share how I implemented them in Pinecone, and discuss some alternatives.

Lexing

The first step for most languages is lexical analysis, or lexing for short. Lexing breaks the source code down into tokens – the smallest meaningful units in the language.

Common token types include:

  • Keywords like if, else, for, while
  • Identifiers (e.g. variable and function names)
  • Literals (numbers, strings, booleans)
  • Punctuation and operators (+, -, *, (, ), etc)
  • Comments

For example, take this trivial expression in Pinecone:

x = 10 * (y + SOME_CONSTANT)

The lexer would convert this to a sequence of tokens like:

IDENTIFIER(x), EQUAL, NUMBER(10), STAR, LPAREN, IDENTIFIER(y), PLUS, IDENTIFIER(SOME_CONSTANT), RPAREN

Notice how it identifies numbers and names as special token types, while treating punctuation as single-character tokens.

The lexer will also typically handle stripping out comments and whitespace between tokens at this stage.

Lexers are usually implemented as a big loop that reads characters one at a time and uses regular expressions or other pattern matching to determine how to group characters into tokens. They are essentially a state machine, switching between states as they recognize different token types.

In Pinecone, I opted to handroll my own lexer rather than using a tool like Flex. It‘s only a few hundred lines of code and gives me full control. However, using Flex is a fine option if you want to get up and running quickly and don‘t need as much customization.

Parsing

In the parsing phase, we convert the flat stream of tokens into a tree structure called an Abstract Syntax Tree or AST. The AST captures the hierarchical structure of the code and the meaning of different language constructs like expressions, statements, and functions.

Parsers are typically implemented as a recursive descent parser, which uses a series of recursive functions to process strings of tokens corresponding to different language constructs. The parser steps through the token stream, matching the tokens to grammar rules and constructing the appropriate AST nodes.

For example, consider parsing the earlier Pinecone expression with prefix notation:

(= x (* 10 (+ y SOME_CONSTANT))) 

The parser would recursively construct this AST:

   =
  / \
 x   *
    / \
   10  +
      / \
     y   SOME_CONSTANT

Implementing a parser that can handle a useful subset of a programming language is where things really start to get tricky. In addition to normal expression parsing, you need to handle things like operator precedence, function calls, and flow control statements.

Tools like Bison or ANTLR can make parsing easier by allowing you to specify the language grammar in a higher-level form. They will then generate the corresponding parser code for you.

However, I again opted to handroll my Pinecone parser. This gave me full control and allowed me to implement some custom syntax not easily expressible in parser generator tools. It also acted as a forcing function for me to really understand how parsing works. The downside is it took a lot more code and effort to get right.

Semantic Analysis

Once we have the raw AST, we need to traverse it to derive critical semantic information like:

  • Associating identifiers with their declarations to catch undefined variables
  • Type checking to ensure type safety and catch type errors
  • Making sure control flow paths are valid (e.g. no return statement outside a function)
  • Checking for use of uninitialized variables
  • and more

We typically traverse the AST recursively, passing down a symbol table that tracks declared identifiers and their types in the current scope. As the semantic analyzer encounters declarations like variable and function definitions, it adds them to the symbol table. When it sees a variable used, it looks it up in the symbol table to validate it exists and has the right type.

Semantic analysis is often overlooked by new language creators, but is essential for implementing anything beyond the most basic language constructs. In Pinecone, my semantic analyzer ended up being one of the most complex pieces of the pipeline.

Code Generation

Finally, we reach the stage where we actually generate runnable code from the AST. There are a few different approaches to code generation:

  1. Generate assembly code directly. This gives you the most low-level control but is very complex and platform-dependent. You‘ll likely need to implement separate backends for each architecture.

  2. Target an existing high-level language like C, Go, or JavaScript. This is much easier than generating assembly, at the cost of less control over performance. You can use the target language‘s compiler to go the last mile to machine code.

  3. Generate bytecode for a custom virtual machine (VM). This gives a good balance of control and simplicity. You define an instruction set for a stack-based or register-based VM and then generate code that runs on that VM. This is how languages like Java, Python, and Ruby work under the hood.

  4. Directly interpret the AST. In this case you don‘t generate any explicit lower-level code, but rather treat the AST itself as executable code. At each node, you recursively interpret the child nodes and then perform the appropriate action like adding two numbers or calling a function. This is the simplest approach but sacrifices performance.

I took a combined approach for Pinecone. For simplicity, I started with a direct AST interpreter so I could get up and running quickly. This worked great for small examples but didn‘t have the performance characteristics I wanted for a systems language.

So I added the ability to transpile Pinecone to C++ and then use GCC to compile the generated code to an executable. This was relatively straightforward and gave me a 10-100x speedup over the interpreted approach. Longer-term I plan to implement a proper Pinecone compiler that generates assembly or LLVM IR directly.

The key lesson is to not get hung up on immediately building the ideal code generator. The great thing about a multi-stage pipeline is that you can swap out or rewrite stages over time as your needs grow.

Optimization

In a full-fledged compiler, there is typically an optimization stage that performs transformations on the generated low-level code to improve performance. Some common optimizations include:

  • Constant folding
  • Dead code elimination
  • Function inlining
  • Loop unrolling
  • Vectorization

These analyze the code to identify opportunities to simplify redundant computations, eliminate unused code paths, and take advantage of low-level processor features.

Optimization is not strictly necessary in an early language implementation, but is important for getting the best performance in a production language. So far I haven‘t focused much on optimization in Pinecone, but it‘s on the roadmap.

Key Takeaways

Here are a few of the key lessons I‘ve learned so far while building Pinecone:

  1. Start simple and gradually layer on complexity. Begin with a minimal set of core language features and then add more advanced concepts over time as you gain confidence. Write plenty of test programs along the way.

  2. Handrolling a lexer is fine, but consider using a parser generator tool to save time. Even if you want to write an AST interpreter at first, design your AST with a future compiler in mind.

  3. Focus on nailing the frontend before worrying too much about code generation. Making your language syntactically and semantically robust is more important than raw performance in the early days.

  4. Read existing language specs and implementations. The Crafting Interpreters book and Make a Lisp project are fantastic resources for understanding how real languages work under the hood.

  5. Experiment with different language paradigms. Write a Lisp to understand homoiconicity, or a Forth to grok stack-based evaluation. Implement a toy language with a radical feature like dependent types or linear types.

  6. Remember to have fun! Language hacking should be a labor of love. Don‘t be afraid to deviate from established norms or build silly things. Plenty of hugely influential languages started as side projects or weird experiments.

Conclusion

I hope this post has given you a high-level roadmap of how to create your own programming language. We‘ve looked at the key stages – lexing, parsing, semantic analysis, code generation, and optimization – and how they fit together to bring a language to life.

Building Pinecone has been one of the most rewarding programming experiences of my career. It‘s given me a much deeper appreciation for the hard work language designers put in and a more intimate understanding of what‘s really happening when I write code.

I encourage you to give language creation a spin, even if just to write a small interpreter. It‘s a great way to test your understanding of core programming concepts and grow your skills.

Be sure to check out the resources linked throughout this post, and stay tuned for more updates on Pinecone. Happy language hacking!

Similar Posts