Create your own scripting language in Python with Sly
If you need to create your own language in a Python project, do not look any further — Sly is there for the rescue! The web is full with Lexer/Parser generators- ANTLR4, Bison, GOLD, Yacc, Flex- the list just goes on and on. There is only one problem — seems like non of them were really designed for Python.
Some time ago, I was assigned the task of designing an inside platform language for my company’s product. The language had to support backward compatibility with a legacy version, run inside a Python code base, deliver extreme performance for user facing processes and allow great maintainability by all engineers in my team. I searched the web and read much about all the different options to meet these requirements, but most did not seem fit to run with Python. Luckily I found Sly.
In this short tutorial I will try to cover the basic terms for creating a scripting language, and give a short example of using Sly to quickly create your own Lexer and Parser.
Basic Terms
Compiler: A program that takes in source code and transforms it into target code. The newly generated code is mostly a lower level language, and it holds the same “meaning” as the given source code. An example for this is most programming languages that require compiling to binary code before executing code (Java, Rust…).
Interpreter: A program that executes source code directly. Executing source code in a target language is called a “Transpiler”. This means that code is translated only at run time and not before. An example for this is languages like Python, that run the source code directly and don’t require pre compilation.
Grammar: A set of rules that define a language’s constructs and hierarchy. For example, let’s take a look at this basic language:
statement
: literal
| function
;
literal
: string
| boolean
;
You can already see this simple grammar holds a lot of power for defining a language. A statement can be either a literal or a function. A literal can be a string or a boolean, and we can continue expanding it.
Lexer: The input text is merely a series of “tokens”, each token having a type and a value. Each Lexer (Sometimes referred to as Tokenizer) has a lexicon defining all the words it would be able to identify. For example —
"A string" --> A token of type String.
1011 --> A token of type Integer.
an_object_identifier --> A token of type Identifier.
Parser: After tokenizing our input text, we can now execute the rules of our language. The Parser executes the rules and creates a syntax tree representing the given text. Each language has certain structures and hierarchies that define it, which are built on the same lexicon of tokens we got from the lexer. At this point we have an AST (Abstract Syntax Tree) that represents the different constructs in the input text, which we can use traverse the tokens. An example for such rules -
Integer Plus Integer --> Define a construct of addition operation.
ObjectLiteral Dot Attribute --> Define a construct of attribute access.
Precedence: When constructing the AST, the parser has to be aware of the order we want to evaluate the different parts of the grammar. Let’s assume we are building some calculator. How does the parser know what order of rules to run when given:
3 + 2 * 1 # Should be (3 + 2) * 1 or 3 + (2 * 1) ?
In order to deal with this ambiguity, we can define precedence rules, such as the basic arithmetic rules you learned at grade school.
Shift/Reduce: When such ambiguity pops up in the language, the parser must decide wether to run the current rule or move to the next rule and come back later. Running the current rule is called “Reducing” — evaluating a part of the current expression and replacing it with the new obtained value. Moving to the next rule is called “Shifting” so that the reduce will happen later. For example —
3 + 2 * 2--> SHIFT # Multiplication comes first
2 * 2--> REDUCE # Execute the multiplication
3 + 4 --> REDUCE # Execution the addition
Associativity: When rules with the same precedence are chained together, the parser needs a way to group them. There are basically four options that I’ll demonstrate in the following example using Python-
1 + 10 + 100 # Will return 111, no matter how we turn it around associative
'a' + 'b' + 'c' # Will return 'abc', the order is left to right left-associative
2 ** 2 ** 3 # Will return 256, the order is right to left right-associative
a := "b" # Walrus operator is non-associative meaning chaining it together has no meaning and is syntactically incorrect
Let’s dive into Sly
Sly is a pure python implementation for Parser/Lexer creation. It was created by Dave Dabeaz and is found on github. Sly is a modernized version of the Dave’s Ply package.
Sly offers great readability, easy syntax and high maintainability for your project. The best part is the great performance it allows using LALR algorithm instead of LL algorithm used in ANTLR and the likes. The project feels made for Python and utilizes all the great features in Python.
To install Sly run -
pip install sly
Let us begin by creating a Lexer class:
As you can see the syntax is a bit confusing at first as we did not define the Token set yet and it is referenced at the beginning of the class. Also, the _ decorator function is not imported. A great explanation for this magic can be found here.
So what did we notice here?
- Pay attention to the order of tokens.
- Make sure your regex patterns are well built and work. This allows you to group similar operators (e.g. SMALLER and BIGGER operators).
Now that we’ve created a basic Lexer, let’s define the Parser. With Sly, our parser also provides us with the entry points into the AST. So any logic that we would wish to implement would go in these methods. A simple Parser class and a rule would look like this:
An expression by the definition given is a number plus a number. Notice the usage of the p variable to access the parts of the given rule values. Any expression not abiding to this rule will fail because of invalid syntax. Now we’ll look at a bigger example:
Notice a few things:
- Each construct can have multiple rules. Expression can be an arithmetic operation or a literal. A literal is also made up with a few other building blocks.
- The tree is traversed bottom up. You should take it into account when implementing any logic. Only when the smallest building pieces that have precedence over others are reduced, the rules above them can be reduced. The stack is popped in the arithmetic operation method because the literal and expression have already been reduced.
- The precedence definition at the beginning — each entry can contain multiple tokens (token names like PLUS, or literals like “.”). Furthermore, each entry has an associativity definition, where the possible values are — “left”, “right”, “nonassoc” and “assoc”.
Alas, we are reaching the end of this tutorial and we need to combine all the parts.
This is just a simple example, and there is a lot more to cover — error handling, error recovery, conflict resolution and more. Hope you enjoyed and found this post helpful for creating you own language!