Rizin
unix-like reverse engineering framework and cli tools
|
title: Creating Parsers
Developing Tree-sitter grammars can have a difficult learning curve, but once you get the hang of it, it can be fun and even zen-like. This document will help get you to get started and to develop a useful mental model.
In order to develop a Tree-sitter parser, there are two dependencies that you need to install:
node
command to be in one of the directories in your PATH
. You'll need Node.js version 6.0 or greater.tree-sitter parse
or tree-sitter test
commands, you must have a C/C++ compiler installed. Tree-sitter will try to look for these compilers in the standard places for each platform.To create a Tree-sitter parser, you need to use the tree-sitter
CLI. You can install the CLI in a few different ways:
tree-sitter-cli
Node.js module using npm
, the Node package manager. This is the recommended approach, and it is discussed further in the next section.PATH
.tree-sitter-cli
Rust crate from source using cargo
, the Rust package manager. See the contributing docs for more information.The preferred convention is to name the parser repository "tree-sitter-" followed by the name of the language.
You can use the npm
command line tool to create a package.json
file that describes your project, and allows your parser to be used from Node.js.
The last command will install the CLI into the node_modules
folder in your working directory. An executable program called tree-sitter
will be created inside of node_modules/.bin/
. You may want to follow the Node.js convention of adding that folder to your your PATH
so that you can easily run this program when working in this directory.
Once you have the CLI installed, create a file called grammar.js
with the following contents:
Then run the following command:
This will generate the C code required to parse this trivial language, as well as a few files that are needed to compile and load this native parser as a Node.js module.
You can test this parser by creating a source file with the contents "hello" and parsing it:
This should print the following:
You now have a working parser.
Let's go over all of the functionality of the tree-sitter
command line tool.
generate
The most important command you'll use is tree-sitter generate
. This command reads the grammar.js
file in your current working directory and creates a file called src/parser.c
, which implements the parser. After making changes to your grammar, just run tree-sitter generate
again.
The first time you run tree-sitter generate
, it will also generate a few other files:
binding.gyp
- This file tells Node.js how to compile your language.bindings/node/index.js
- This is the file that Node.js initially loads when using your language.bindings/node/binding.cc
- This file wraps your language in a JavaScript object when used in Node.js.bindings/rust/lib.rs
- This file wraps your language in a Rust crate when used in Rust.bindings/rust/build.rs
- This file wraps the building process for the Rust crate.src/tree_sitter/parser.h
- This file provides some basic C definitions that are used in your generated parser.c
file.If there is an ambiguity or local ambiguity in your grammar, Tree-sitter will detect it during parser generation, and it will exit with a Unresolved conflict
error message. See below for more information on these errors.
test
The tree-sitter test
command allows you to easily test that your parser is working correctly.
For each rule that you add to the grammar, you should first create a test that describes how the syntax trees should look when parsing that rule. These tests are written using specially-formatted text files in the corpus/
or test/corpus/
directories within your parser's root folder.
For example, you might have a file called test/corpus/statements.txt
that contains a series of entries like this:
=
(equal sign) characters.-
(dash) characters.Then, the expected output syntax tree is written as an S-expression. The exact placement of whitespace in the S-expression doesn't matter, but ideally the syntax tree should be legible. Note that the S-expression does not show syntax nodes like func
, (
and ;
, which are expressed as strings and regexes in the grammar. It only shows the named nodes, as described in this section of the page on parser usage.
The expected output section can also optionally show the field names associated with each child node. To include field names in your tests, you write a node's field name followed by a colon, before the node itself in the S-expression:
These tests are important. They serve as the parser's API documentation, and they can be run every time you change the grammar to verify that everything still parses correctly.
By default, the tree-sitter test
command runs all of the tests in your corpus
or test/corpus/
folder. To run a particular test, you can use the -f
flag:
The recommendation is to be comprehensive in adding tests. If it's a visible node, add it to a test file in your corpus
directory. It's typically a good idea to test all of the permutations of each language construct. This increases test coverage, but doubly acquaints readers with a way to examine expected outputs and understand the "edges" of a language.
You might notice that the first time you run tree-sitter test
after regenerating your parser, it takes some extra time. This is because Tree-sitter automatically compiles your C code into a dynamically-loadable library. It recompiles your parser as-needed whenever you update it by re-running tree-sitter generate
.
The tree-sitter test
command will also run any syntax highlighting tests in the test/highlight
folder, if it exists. For more information about syntax highlighting tests, see the syntax highlighting page.
parse
You can run your parser on an arbitrary file using tree-sitter parse
. This will print the resulting the syntax tree, including nodes' ranges and field names, like this:
You can pass any number of file paths and glob patterns to tree-sitter parse
, and it will parse all of the given files. The command will exit with a non-zero status code if any parse errors occurred. You can also prevent the syntax trees from being printed using the --quiet
flag. Additionally, the --stat
flag prints out aggregated parse success/failure information for all processed files. This makes tree-sitter parse
usable as a secondary testing strategy: you can check that a large number of files parse without error:
highlight
You can run syntax highlighting on an arbitrary file using tree-sitter highlight
. This can either output colors directly to your terminal using ansi escape codes, or produce HTML (if the --html
flag is passed). For more information, see the syntax highlighting page.
The following is a complete list of built-in functions you can use in your grammar.js
to define rules. Use-cases for some of these functions will be explained in more detail in later sections.
$
object) - Every grammar rule is written as a JavaScript function that takes a parameter conventionally called $
. The syntax $.identifier
is how you refer to another grammar symbol within a rule.seq(rule1, rule2, ...)
- This function creates a rule that matches any number of other rules, one after another. It is analogous to simply writing multiple symbols next to each other in EBNF notation.choice(rule1, rule2, ...)
- This function creates a rule that matches one of a set of possible rules. The order of the arguments does not matter. This is analogous to the |
(pipe) operator in EBNF notation.repeat(rule)
- This function creates a rule that matches zero-or-more occurrences of a given rule. It is analogous to the {x}
(curly brace) syntax in EBNF notation.repeat1(rule)
- This function creates a rule that matches one-or-more occurrences of a given rule. The previous repeat
rule is implemented in terms of repeat1
but is included because it is very commonly used.optional(rule)
- This function creates a rule that matches zero or one occurrence of a given rule. It is analogous to the [x]
(square bracket) syntax in EBNF notation.prec(number, rule)
- This function marks the given rule with a numerical precedence which will be used to resolve LR(1) Conflicts at parser-generation time. When two rules overlap in a way that represents either a true ambiguity or a local ambiguity given one token of lookahead, Tree-sitter will try to resolve the conflict by matching the rule with the higher precedence. The default precedence of all rules is zero. This works similarly to the precedence directives in Yacc grammars.prec.left([number], rule)
- This function marks the given rule as left-associative (and optionally applies a numerical precedence). When an LR(1) conflict arises in which all of the rules have the same numerical precedence, Tree-sitter will consult the rules' associativity. If there is a left-associative rule, Tree-sitter will prefer matching a rule that ends earlier. This works similarly to associativity directives in Yacc grammars.prec.right([number], rule)
- This function is like prec.left
, but it instructs Tree-sitter to prefer matching a rule that ends later.prec.dynamic(number, rule)
- This function is similar to prec
, but the given numerical precedence is applied at runtime instead of at parser generation time. This is only necessary when handling a conflict dynamically using the conflicts
field in the grammar, and when there is a genuine ambiguity: multiple rules correctly match a given piece of code. In that event, Tree-sitter compares the total dynamic precedence associated with each rule, and selects the one with the highest total. This is similar to dynamic precedence directives in Bison grammars.token(rule)
- This function marks the given rule as producing only a single token. Tree-sitter's default is to treat each String or RegExp literal in the grammar as a separate token. Each token is matched separately by the lexer and returned as its own leaf node in the tree. The token
function allows you to express a complex rule using the functions described above (rather than as a single regular expression) but still have Tree-sitter treat it as a single token.token.immediate(rule)
- Usually, whitespace (and any other extras, such as comments) is optional before each token. This function means that the token will only match if there is no whitespace.alias(rule, name)
- This function causes the given rule to appear with an alternative name in the syntax tree. If name
is a symbol, as in alias($.foo, $.bar)
, then the aliased rule will appear as a named node called bar
. And if name
is a string literal, as in ‘alias($.foo, 'bar’), then the aliased rule will appear as an [anonymous node][named-vs-anonymous-nodes-section], as if the rule had been written as the simple string.
**Field Names :
field(name, rule)`** - This function assigns a field name to the child node(s) matched by the given rule. In the resulting syntax tree, you can then use that field name to access specific children.In addition to the name
and rules
fields, grammars have a few other optional public fields that influence the behavior of the parser.
extras
** - an array of tokens that may appear anywhere in the language. This is often used for whitespace and comments. The default value of extras
is to accept whitespace. To control whitespace explicitly, specify extras: $ => []
in your grammar.inline
** - an array of rule names that should be automatically removed from the grammar by replacing all of their usages with a copy of their definition. This is useful for rules that are used in multiple places but for which you don't want to create syntax tree nodes at runtime.conflicts
** - an array of arrays of rule names. Each inner array represents a set of rules that's involved in an LR(1) conflict that is intended to exist in the grammar. When these conflicts occur at runtime, Tree-sitter will use the GLR algorithm to explore all of the possible interpretations. If multiple parses end up succeeding, Tree-sitter will pick the subtree whose corresponding rule has the highest total dynamic precedence.externals
** - an array of token names which can be returned by an external scanner. External scanners allow you to write custom C code which runs during the lexing process in order to handle lexical rules (e.g. Python's indentation tokens) that cannot be described by regular expressions.word
** - the name of a token that will match keywords for the purpose of the keyword extraction optimization.supertypes
** an array of hidden rule names which should be considered to be 'supertypes' in the generated node types file.Writing a grammar requires creativity. There are an infinite number of CFGs (context-free grammars) that can be used to describe any given language. In order to produce a good Tree-sitter parser, you need to create a grammar with two important properties:
It's unlikely that you'll be able to satisfy these two properties just by translating an existing context-free grammar directly into Tree-sitter's grammar format. There are a few kinds of adjustments that are often required. The following sections will explain these adjustments in more depth.
It's usually a good idea to find a formal specification for the language you're trying to parse. This specification will most likely contain a context-free grammar. As you read through the rules of this CFG, you will probably discover a complex and cyclic graph of relationships. It might be unclear how you should navigate this graph as you define your grammar.
Although languages have very different constructs, their constructs can often be categorized in to similar groups like Declarations, Definitions, Statements, Expressions, Types, and Patterns. In writing your grammar, a good first step is to create just enough structure to include all of these basic groups of symbols. For a language like Go, you might start with something like this:
Some of the details of this grammar will be explained in more depth later on, but if you focus on the TODO
comments, you can see that the overall strategy is breadth-first. Notably, this initial skeleton does not need to directly match an exact subset of the context-free grammar in the language specification. It just needs to touch on the major groupings of rules in as simple and obvious a way as possible.
With this structure in place, you can now freely decide what part of the grammar to flesh out next. For example, you might decide to start with types. One-by-one, you could define the rules for writing basic types and composing them into more complex types:
After developing the type sublanguage a bit further, you might decide to switch to working on statements or expressions instead. It's often useful to check your progress by trying to parse some real code using tree-sitter parse
.
And remember to add tests for each rule in your corpus
folder!
Imagine that you were just starting work on the Tree-sitter JavaScript parser. Naively, you might try to directly mirror the structure of the ECMAScript Language Spec. To illustrate the problem with this approach, consider the following line of code:
According to the specification, this line is a ReturnStatement
, the fragment x + y
is an AdditiveExpression
, and x
and y
are both IdentifierReferences
. The relationship between these constructs is captured by a complex series of production rules:
The language spec encodes the twenty different precedence levels of JavaScript expressions using twenty levels of indirection between IdentifierReference
and Expression
. If we were to create a concrete syntax tree representing this statement according to the language spec, it would have twenty levels of nesting, and it would contain nodes with names like BitwiseXORExpression
, which are unrelated to the actual code.
To produce a readable syntax tree, we'd like to model JavaScript expressions using a much flatter structure like this:
Of course, this flat structure is highly ambiguous. If we try to generate a parser, Tree-sitter gives us an error message:
For an expression like -a * b
, it's not clear whether the -
operator applies to the a * b
or just to the a
. This is where the prec
function described above comes into play. By wrapping a rule with prec
, we can indicate that certain sequence of symbols should bind to each other more tightly than others. For example, the ‘’-', $._expressionsequence in
unary_expressionshould bind more tightly than the
$._expression, '+', $._expressionsequence in
binary_expression`:
Applying a higher precedence in unary_expression
fixes that conflict, but there is still another conflict:
For an expression like a * b * c
, it's not clear whether we mean a * (b * c)
or (a * b) * c
. This is where prec.left
and prec.right
come into use. We want to select the second interpretation, so we use prec.left
.
You may have noticed in the above examples that some of the grammar rule name like _expression
and _type
began with an underscore. Starting a rule's name with an underscore causes the rule to be hidden in the syntax tree. This is useful for rules like _expression
in the grammars above, which always just wrap a single child node. If these nodes were not hidden, they would add substantial depth and noise to the syntax tree without making it any easier to understand.
Often, it's easier to analyze a syntax nodes if you can refer to its children by name instead of by their position in an ordered list. Tree-sitter grammars support this using the field
function. This function allows you to assign unique names to some or all of a node's children:
Adding fields like this allows you to retrieve nodes using the field APIs.
Tree-sitter's parsing process is divided into two phases: parsing (which is described above) and lexing - the process of grouping individual characters into the language's fundamental tokens. There are a few important things to know about how Tree-sitter's lexing works.
Grammars often contain multiple tokens that can match the same characters. For example, a grammar might contain the tokens ("if"
and /[a-z]+/
). Tree-sitter differentiates between these conflicting tokens in a few ways:
token
function, the given precedence values serve as instructions to the lexer. If there are two valid tokens that match the characters at a given position in the document, Tree-sitter will select the one with the higher precedence.String
over a token specified as a RegExp
.Many languages have a set of keyword tokens (e.g. if
, for
, return
), as well as a more general token (e.g. identifier
) that matches any word, including many of the keyword strings. For example, JavaScript has a keyword instanceof
, which is used as a binary operator, like this:
The following, however, is not valid JavaScript:
A keyword like instanceof
cannot be followed immediately by another letter, because then it would be tokenized as an identifier
, even though an identifier is not valid at that position. Because Tree-sitter uses context-aware lexing, as described above, it would not normally impose this restriction. By default, Tree-sitter would recognize instanceofSomething
as two separate tokens: the instanceof
keyword followed by an identifier
.
Fortunately, Tree-sitter has a feature that allows you to fix this, so that you can match the behavior of other standard parsers: the word
token. If you specify a word
token in your grammar, Tree-sitter will find the set of keyword tokens that match strings also matched by the word
token. Then, during lexing, instead of matching each of these keywords individually, Tree-sitter will match the keywords via a two-step process where it first matches the word
token.
For example, suppose we added identifier
as the word
token in our JavaScript grammar:
Tree-sitter would identify typeof
and instanceof
as keywords. Then, when parsing the invalid code above, rather than scanning for the instanceof
token individually, it would scan for an identifier
first, and find instanceofSomething
. It would then correctly recognize the code as invalid.
Aside from improving error detection, keyword extraction also has performance benefits. It allows Tree-sitter to generate a smaller, simpler lexing function, which means that the parser will compile much more quickly.
Many languages have some tokens whose structure is impossible or inconvenient to describe with a regular expression. Some examples:
Tree-sitter allows you to handle these kinds of tokens using external scanners. An external scanner is a set of C functions that you, the grammar author, can write by hand in order to add custom logic for recognizing certain tokens.
To use an external scanner, there are a few steps. First, add an externals
section to your grammar. This section should list the names of all of your external tokens. These names can then be used elsewhere in your grammar.
Then, add another C or C++ source file to your project. Currently, its path must be src/scanner.c
or src/scanner.cc
for the CLI to recognize it. Be sure to add this file to the sources
section of your binding.gyp
file so that it will be included when your project is compiled by Node.js.
In this new source file, define an enum
type containing the names of all of your external tokens. The ordering of this enum must match the order in your grammar's externals
array.
Finally, you must define five functions with specific names, based on your language's name and five actions: create, destroy, serialize, deserialize, and scan. These functions must all use C linkage, so if you're writing the scanner in C++, you need to declare them with the extern "C"
qualifier.
This function should create your scanner object. It will only be called once anytime your language is set on a parser. Often, you will want to allocate memory on the heap and return a pointer to it. If your external scanner doesn't need to maintain any state, it's ok to return NULL
.
This function should free any memory used by your scanner. It is called once when a parser is deleted or assigned a different language. It receives as an argument the same pointer that was returned from the create function. If your create function didn't allocate any memory, this function can be a noop.
This function should copy the complete state of your scanner into a given byte buffer, and return the number of bytes written. The function is called every time the external scanner successfully recognizes a token. It receives a pointer to your scanner and a pointer to a buffer. The maximum number of bytes that you can write is given by the TREE_SITTER_SERIALIZATION_BUFFER_SIZE
constant, defined in the tree_sitter/parser.h
header file.
The data that this function writes will ultimately be stored in the syntax tree so that the scanner can be restored to the right state when handling edits or ambiguities. For your parser to work correctly, the serialize
function must store its entire state, and deserialize
must restore the entire state. For good performance, you should design your scanner so that its state can be serialized as quickly and compactly as possible.
This function should restore the state of your scanner based the bytes that were previously written by the serialize
function. It is called with a pointer to your scanner, a pointer to the buffer of bytes, and the number of bytes that should be read.
This function is responsible for recognizing external tokens. It should return true
if a token was recognized, and false
otherwise. It is called with a "lexer" struct with the following fields:
int32_t lookahead
** - The current next character in the input stream, represented as a 32-bit unicode code point.TSSymbol result_symbol
** - The symbol that was recognized. Your scan function should assign to this field one of the values from the TokenType
enum, described above.void (*advance)(TSLexer *, bool skip)
** - A function for advancing to the next character. If you pass true
for the second argument, the current character will be treated as whitespace.void (*mark_end)(TSLexer *)
** - A function for marking the end of the recognized token. This allows matching tokens that require multiple characters of lookahead. By default (if you don't call mark_end
), any character that you moved past using the advance
function will be included in the size of the token. But once you call mark_end
, then any later calls to advance
will not increase the size of the returned token. You can call mark_end
multiple times to increase the size of the token.uint32_t (*get_column)(TSLexer *)
** - A function for querying the current column position of the lexer. It returns the number of codepoints since the start of the current line. The codepoint position is recalculated on every call to this function by reading from the start of the line.bool (*is_at_included_range_start)(TSLexer *)
** - A function for checking if the parser has just skipped some characters in the document. When parsing an embedded document using the ts_parser_set_included_ranges
function (described in the multi-language document section), your scanner may want to apply some special behavior when moving to a disjoint part of the document. For example, in EJS documents, the JavaScript parser uses this function to enable inserting automatic semicolon tokens in between the code directives, delimited by <%
and %>
.The third argument to the scan
function is an array of booleans that indicates which of your external tokens are currently expected by the parser. You should only look for a given token if it is valid according to this array. At the same time, you cannot backtrack, so you may need to combine certain pieces of logic.