Rizin
unix-like reverse engineering framework and cli tools
|
title: Implementation
Tree-sitter consists of two components: a C library (libtree-sitter
), and a command-line tool (the tree-sitter
CLI).
The library, libtree-sitter
, is used in combination with the parsers generated by the CLI, to produce syntax trees from source code and keep the syntax trees up-to-date as the source code changes. libtree-sitter
is designed to be embedded in applications. It is written in plain C. Its interface is specified in the header file tree_sitter/api.h
.
The CLI is used to generate a parser for a language by supplying a context-free grammar describing the language. The CLI is a build tool; it is no longer needed once a parser has been generated. It is written in Rust, and is available on crates.io, npm, and as a pre-built binary on GitHub.
The tree-sitter
CLI's most important feature is the generate
subcommand. This subcommand reads context-free grammar from a file called grammar.js
and outputs a parser as a C file called parser.c
. The source files in the cli/src
directory all play a role in producing the code in parser.c
. This section will describe some key parts of this process.
First, Tree-sitter must must evaluate the JavaScript code in grammar.js
and convert the grammar to a JSON format. It does this by shelling out to node
. The format of the grammars is formally specified by the JSON schema in grammar-schema.json. The parsing is implemented in parse_grammar.rs.
A Tree-sitter grammar is composed of a set of rules - objects that describe how syntax nodes can be composed from other syntax nodes. There are several types of rules: symbols, strings, regexes, sequences, choices, repetitions, and a few others. Internally, these are all represented using an enum called Rule
.
Once a grammar has been parsed, it must be transformed in several ways before it can be used to generate a parser. Each transformation is implemented by a separate file in the prepare_grammar
directory, and the transformations are ultimately composed together in prepare_grammar/mod.rs
.
At the end of these transformations, the initial grammar is split into two grammars: a syntax grammar and a lexical grammar. The syntax grammar describes how the language's non-terminal symbols are constructed from other grammar symbols, and the lexical grammar describes how the grammar's terminal symbols (strings and regexes) can be composed from individual characters.
WIP