Rizin
unix-like reverse engineering framework and cli tools
section-2-using-parsers\ilineb

title: Using Parsers

permalink: using-parsers

Using Parsers

All of Tree-sitter's parsing functionality is exposed through C APIs. Applications written in higher-level languages can use Tree-sitter via binding libraries like node-tree-sitter or the tree-sitter rust crate, which have their own documentation.

This document will describe the general concepts of how to use Tree-sitter, which should be relevant regardless of what language you're using. It also goes into some C-specific details that are useful if you're using the C API directly or are building a new binding to a different language.

All of the API functions shown here are declared and documented in the tree_sitter/api.h header file. You may also want to browse the online Rust API docs, which correspond to the C APIs closely.

Getting Started

Building the Library

To build the library on a POSIX system, just run make in the Tree-sitter directory. This will create a static library called libtree-sitter.a as well as dynamic libraries.

Alternatively, you can incorporate the library in a larger project's build system by adding one source file to the build. This source file needs two directories to be in the include path when compiled:

source file:

  • tree-sitter/lib/src/lib.c

include directories:

  • tree-sitter/lib/src
  • tree-sitter/lib/include

The Basic Objects

There are four main types of objects involved when using Tree-sitter: languages, parsers, syntax trees, and syntax nodes. In C, these are called TSLanguage, TSParser, TSTree, and TSNode.

  • A TSLanguage is an opaque object that defines how to parse a particular programming language. The code for each TSLanguage is generated by Tree-sitter. Many languages are already available in separate git repositories within the the Tree-sitter GitHub organization. See the next page for how to create new languages.
  • A TSParser is a stateful object that can be assigned a TSLanguage and used to produce a TSTree based on some source code.
  • A TSTree represents the syntax tree of an entire source code file. It contains TSNode instances that indicate the structure of the source code. It can also be edited and used to produce a new TSTree in the event that the source code changes.
  • A TSNode represents a single node in the syntax tree. It tracks its start and end positions in the source code, as well as its relation to other nodes like its parent, siblings and children.

An Example Program

Here's an example of a simple C program that uses the Tree-sitter JSON parser.

// Filename - test-json-parser.c
#include <assert.h>
#include <string.h>
#include <stdio.h>
// Declare the `tree_sitter_json` function, which is
// implemented by the `tree-sitter-json` library.
TSLanguage *tree_sitter_json();
int main() {
// Create a parser.
// Set the parser's language (JSON in this case).
ts_parser_set_language(parser, tree_sitter_json());
// Build a syntax tree based on source code stored in a string.
const char *source_code = "[1, null]";
source_code,
strlen(source_code)
);
// Get the root node of the syntax tree.
TSNode root_node = ts_tree_root_node(tree);
// Get some child nodes.
TSNode array_node = ts_node_named_child(root_node, 0);
TSNode number_node = ts_node_named_child(array_node, 0);
// Check that the nodes have the expected types.
assert(strcmp(ts_node_type(root_node), "document") == 0);
assert(strcmp(ts_node_type(array_node), "array") == 0);
assert(strcmp(ts_node_type(number_node), "number") == 0);
// Check that the nodes have the expected child counts.
assert(ts_node_child_count(root_node) == 1);
assert(ts_node_child_count(array_node) == 5);
assert(ts_node_named_child_count(array_node) == 2);
assert(ts_node_child_count(number_node) == 0);
// Print the syntax tree as an S-expression.
char *string = ts_node_string(root_node);
printf("Syntax tree: %s\n", string);
// Free all of the heap-allocated memory.
free(string);
return 0;
}
const char * ts_node_type(TSNode)
Definition: node.c:420
void ts_parser_delete(TSParser *parser)
Definition: parser.c:1725
TSNode ts_node_named_child(TSNode, uint32_t)
Definition: node.c:496
char * ts_node_string(TSNode)
Definition: node.c:426
uint32_t ts_node_named_child_count(TSNode)
Definition: node.c:611
void ts_tree_delete(TSTree *self)
Definition: tree.c:26
TSNode ts_tree_root_node(const TSTree *self)
Definition: tree.c:36
bool ts_parser_set_language(TSParser *self, const TSLanguage *language)
Definition: parser.c:1754
uint32_t ts_node_child_count(TSNode)
Definition: node.c:602
TSParser * ts_parser_new(void)
Definition: parser.c:1704
TSTree * ts_parser_parse_string(TSParser *self, const TSTree *old_tree, const char *string, uint32_t length)
Definition: parser.c:1945
#define NULL
Definition: cris-opc.c:27
_Use_decl_annotations_ int __cdecl printf(const char *const _Format,...)
Definition: cs_driver.c:93
RZ_API void Ht_() free(HtName_(Ht) *ht)
Definition: ht_inc.c:130
assert(limit<=UINT32_MAX/2)
int main(int argc, char **argv)
Definition: rz-bb.c:29
Definition: api.h:92
Definition: tree.h:15

This program uses the Tree-sitter C API, which is declared in the header file tree-sitter/api.h, so we need to add the tree-sitter/lib/include directory to the include path. We also need to link libtree-sitter.a into the binary. We compile the source code of the JSON language directly into the binary as well.

clang \
-I tree-sitter/lib/include \
test-json-parser.c \
tree-sitter-json/src/parser.c \
tree-sitter/libtree-sitter.a \
-o test-json-parser
./test-json-parser

Basic Parsing

Providing the Code

In the example above, we parsed source code stored in a simple string using the ts_parser_parse_string function:

TSParser *self,
const TSTree *old_tree,
const char *string,
);
static static sync static getppid static getegid const char static filename char static len const char char static bufsiz static mask static vfork const void static prot static getpgrp const char static swapflags static arg static fd static protocol static who struct sockaddr static addrlen static backlog struct timeval struct timezone static tz const struct iovec static count static mode const void const struct sockaddr static tolen const char static pathname void static offset struct stat static buf void long static basep static whence static length const void static len static semflg const void static shmflg const struct timespec struct timespec static rem const char static group const void length
Definition: sflib.h:133
unsigned int uint32_t
Definition: sftypes.h:29

You may want to parse source code that's stored in a custom data structure, like a piece table or a rope. In this case, you can use the more general ts_parser_parse function:

TSParser *self,
const TSTree *old_tree,
);
TSTree * ts_parser_parse(TSParser *self, const TSTree *old_tree, TSInput input)
Definition: parser.c:1844
Definition: api.h:67
static bool input(void *ud, zip_uint8_t *data, zip_uint64_t length)

The TSInput structure lets you to provide your own function for reading a chunk of text at a given byte offset and row/column position. The function can return text encoded in either UTF8 or UTF16. This interface allows you to efficiently parse text that is stored in your own data structure.

typedef struct {
void *payload;
const char *(*read)(
void *payload,
uint32_t byte_offset,
TSPoint position,
uint32_t *bytes_read
);
TSInputEncoding
Definition: api.h:44
Definition: api.h:55

Syntax Nodes

Tree-sitter provides a DOM-style interface for inspecting syntax trees. A syntax node's type is a string that indicates which grammar rule the node represents.

const char *ts_node_type(TSNode);

Syntax nodes store their position in the source code both in terms of raw bytes and row/column coordinates:

typedef struct {
uint32_t row;
uint32_t column;
uint32_t ts_node_start_byte(TSNode)
Definition: node.c:36
TSPoint ts_node_start_point(TSNode)
Definition: node.c:40
uint32_t ts_node_end_byte(TSNode)
Definition: node.c:406
TSPoint ts_node_end_point(TSNode)
Definition: node.c:410

Retrieving Nodes

Every tree has a root node:

Once you have a node, you can access the node's children:

TSNode ts_node_child(TSNode, uint32_t)
Definition: node.c:492

You can also access its siblings and parent:

TSNode ts_node_prev_sibling(TSNode)
Definition: node.c:628
TSNode ts_node_parent(TSNode)
Definition: node.c:461
TSNode ts_node_next_sibling(TSNode)
Definition: node.c:620

These methods may all return a null node to indicate, for example, that a node does not have a next sibling. You can check if a node is null:

bool ts_node_is_null(TSNode)
Definition: node.c:434

Named vs Anonymous Nodes

Tree-sitter produces concrete syntax trees - trees that contain nodes for every individual token in the source code, including things like commas and parentheses. This is important for use-cases that deal with individual tokens, like syntax highlighting. But some types of code analysis are easier to perform using an abstract syntax tree - a tree in which the less important details have been removed. Tree-sitter's trees support these use cases by making a distinction between named and anonymous nodes.

Consider a grammar rule like this:

if_statement: ($) => seq("if", "(", $._expression, ")", $._statement);

A syntax node representing an if_statement in this language would have 5 children: the condition expression, the body statement, as well as the if, (, and ) tokens. The expression and the statement would be marked as named nodes, because they have been given explicit names in the grammar. But the if, (, and ) nodes would not be named nodes, because they are represented in the grammar as simple strings.

You can check whether any given node is named:

bool ts_node_is_named(TSNode)
Definition: node.c:442

When traversing the tree, you can also choose to skip over anonymous nodes by using the _named_ variants of all of the methods described above:

If you use this group of methods, the syntax tree functions much like an abstract syntax tree.

Node Field Names

To make syntax nodes easier to analyze, many grammars assign unique field names to particular child nodes. The next page explains how to do this on your own grammars. If a syntax node has fields, you can access its children using their field name:

TSNode self,
const char *field_name,
uint32_t field_name_length
);
TSNode ts_node_child_by_field_name(TSNode self, const char *field_name, uint32_t field_name_length)
Definition: node.c:589
@ field_name
Definition: parser.c:1737

Fields also have numeric ids that you can use, if you want to avoid repeated string comparisons. You can convert between strings and ids using the TSLanguage:

TSFieldId ts_language_field_id_for_name(const TSLanguage *, const char *, uint32_t)
Definition: language.c:119
const char * ts_language_field_name_for_id(const TSLanguage *, TSFieldId)
Definition: language.c:107
uint32_t ts_language_field_count(const TSLanguage *)
Definition: language.c:14
uint16_t TSFieldId
Definition: parser.h:20

The field ids can be used in place of the name:

TSNode ts_node_child_by_field_id(TSNode, TSFieldId)
Definition: node.c:500

Advanced Parsing

Editing

In applications like text editors, you often need to re-parse a file after its source code has changed. Tree-sitter is designed to support this use case efficiently. There are two steps required. First, you must edit the syntax tree, which adjusts the ranges of its nodes so that they stay in sync with the code.

typedef struct {
uint32_t start_byte;
uint32_t old_end_byte;
uint32_t new_end_byte;
TSPoint start_point;
TSPoint old_end_point;
TSPoint new_end_point;
void ts_tree_edit(TSTree *, const TSInputEdit *);
void ts_tree_edit(TSTree *self, const TSInputEdit *edit)
Definition: tree.c:44

Then, you can call ts_parser_parse again, passing in the old tree. This will create a new tree that internally shares structure with the old tree.

When you edit a syntax tree, the positions of its nodes will change. If you have stored any TSNode instances outside of the TSTree, you must update their positions separately, using the same TSInput value, in order to update their cached positions.

void ts_node_edit(TSNode *, const TSInputEdit *);
void ts_node_edit(TSNode *, const TSInputEdit *)
Definition: node.c:676

This ts_node_edit function is only needed in the case where you have retrieved TSNode instances before editing the tree, and then after editing the tree, you want to continue to use those specific node instances. Often, you'll just want to re-fetch nodes from the edited tree, in which case ts_node_edit is not needed.

Multi-language Documents

Sometimes, different parts of a file may be written in different languages. For example, templating languages like EJS and ERB allow you to generate HTML by writing a mixture of HTML and another language like JavaScript or Ruby.

Tree-sitter handles these types of documents by allowing you to create a syntax tree based on the text in certain ranges of a file.

typedef struct {
TSPoint start_point;
TSPoint end_point;
uint32_t start_byte;
uint32_t end_byte;
TSParser *self,
const TSRange *ranges,
uint32_t range_count
);
bool ts_parser_set_included_ranges(TSParser *self, const TSRange *ranges, uint32_t length)
Definition: parser.c:1811
Definition: api.h:60

For example, consider this ERB document:

<ul>
<% people.each do |person| %>
<li><%= person.name %></li>
<% end %>
</ul>

Conceptually, it can be represented by three syntax trees with overlapping ranges: an ERB syntax tree, a Ruby syntax tree, and an HTML syntax tree. You could generate these syntax trees with the following code:

#include <string.h>
// These functions are each implemented in their own repo.
const TSLanguage *tree_sitter_embedded_template();
const TSLanguage *tree_sitter_html();
const TSLanguage *tree_sitter_ruby();
int main(int argc, const char **argv) {
const char *text = argv[1];
unsigned len = strlen(src);
// Parse the entire text as ERB.
ts_parser_set_language(parser, tree_sitter_embedded_template());
TSNode erb_root_node = ts_tree_root_node(erb_tree);
// In the ERB syntax tree, find the ranges of the `content` nodes,
// which represent the underlying HTML, and the `code` nodes, which
// represent the interpolated Ruby.
TSRange html_ranges[10];
TSRange ruby_ranges[10];
unsigned html_range_count = 0;
unsigned ruby_range_count = 0;
unsigned child_count = ts_node_child_count(erb_root_node);
for (unsigned i = 0; i < child_count; i++) {
TSNode node = ts_node_child(erb_root_node, i);
if (strcmp(ts_node_type(node), "content") == 0) {
html_ranges[html_range_count++] = (TSRange) {
};
} else {
TSNode code_node = ts_node_named_child(node, 0);
ruby_ranges[ruby_range_count++] = (TSRange) {
ts_node_start_point(code_node),
ts_node_end_point(code_node),
ts_node_start_byte(code_node),
ts_node_end_byte(code_node),
};
}
}
// Use the HTML ranges to parse the HTML.
ts_parser_set_language(parser, tree_sitter_html());
ts_parser_set_included_ranges(parser, html_ranges, html_range_count);
TSNode html_root_node = ts_tree_root_node(html_tree);
// Use the Ruby ranges to parse the Ruby.
ts_parser_set_language(parser, tree_sitter_ruby());
ts_parser_set_included_ranges(parser, ruby_ranges, ruby_range_count);
TSNode ruby_root_node = ts_tree_root_node(ruby_tree);
// Print all three trees.
char *erb_sexp = ts_node_string(erb_root_node);
char *html_sexp = ts_node_string(html_root_node);
char *ruby_sexp = ts_node_string(ruby_root_node);
printf("ERB: %s\n", erb_sexp);
printf("HTML: %s\n", html_sexp);
printf("Ruby: %s\n", ruby_sexp);
return 0;
}
size_t len
Definition: 6502dis.c:15
lzma_index ** i
Definition: index.h:629
lzma_index * src
Definition: index.h:567
static static fork const void static count static fd const char const char static newpath char char argv
Definition: sflib.h:40

This API allows for great flexibility in how languages can be composed. Tree-sitter is not responsible for mediating the interactions between languages. Instead, you are free to do that using arbitrary application-specific logic.

Concurrency

Tree-sitter supports multi-threaded use cases by making syntax trees very cheap to copy.

TSTree * ts_tree_copy(const TSTree *self)
Definition: tree.c:21

Internally, copying a syntax tree just entails incrementing an atomic reference count. Conceptually, it provides you a new tree which you can freely query, edit, reparse, or delete on a new thread while continuing to use the original tree on a different thread. Note that individual TSTree instances are not thread safe; you must copy a tree if you want to use it on multiple threads simultaneously.

Other Tree Operations

Walking Trees with Tree Cursors

You can access every node in a syntax tree using the TSNode APIs described above, but if you need to access a large number of nodes, the fastest way to do so is with a tree cursor. A cursor is a stateful object that allows you to walk a syntax tree with maximum efficiency.

You can initialize a cursor from any node:

TSTreeCursor ts_tree_cursor_new(TSNode)
Definition: tree_cursor.c:70

You can move the cursor around the tree:

bool ts_tree_cursor_goto_next_sibling(TSTreeCursor *)
Definition: tree_cursor.c:206
bool ts_tree_cursor_goto_parent(TSTreeCursor *)
Definition: tree_cursor.c:239
bool ts_tree_cursor_goto_first_child(TSTreeCursor *)
Definition: tree_cursor.c:101

These methods return true if the cursor successfully moved and false if there was no node to move to.

You can always retrieve the cursor's current node, as well as the field name that is associated with the current node.

TSFieldId ts_tree_cursor_current_field_id(const TSTreeCursor *)
Definition: tree_cursor.c:431
TSNode ts_tree_cursor_current_node(const TSTreeCursor *)
Definition: tree_cursor.c:262
const char * ts_tree_cursor_current_field_name(const TSTreeCursor *)
Definition: tree_cursor.c:469

Pattern Matching with Queries

Many code analysis tasks involve searching for patterns in syntax trees. Tree-sitter provides a small declarative language for expressing these patterns and searching for matches. The language is similar to the format of Tree-sitter's unit test system.

Query Syntax

A query consists of one or more patterns, where each pattern is an S-expression that matches a certain set of nodes in a syntax tree. The expression to match a given node consists of a pair of parentheses containing two things: the node's type, and optionally, a series of other S-expressions that match the node's children. For example, this pattern would match any binary_expression node whose children are both number_literal nodes:

(binary_expression (number_literal) (number_literal))

Children can also be omitted. For example, this would match any binary_expression where at least one of child is a string_literal node:

(binary_expression (string_literal))

Fields

In general, it's a good idea to make patterns more specific by specifying field names associated with child nodes. You do this by prefixing a child pattern with a field name followed by a colon. For example, this pattern would match an assignment_expression node where the left child is a member_expression whose object is a call_expression.

(assignment_expression
left: (member_expression
object: (call_expression)))

Negated Fields

You can also constrain a pattern so that it only matches nodes that lack a certain field. To do this, add a field name prefixed by a ! within the parent pattern. For example, this pattern would match a class declaration with no type parameters:

(class_declaration
name: (identifier) @class_name
!type_parameters)

Anonymous Nodes

The parenthesized syntax for writing nodes only applies to named nodes. To match specific anonymous nodes, you write their name between double quotes. For example, this pattern would match any binary_expression where the operator is != and the right side is null:

(binary_expression
operator: "!="
right: (null))

Capturing Nodes

When matching patterns, you may want to process specific nodes within the pattern. Captures allow you to associate names with specific nodes in a pattern, so that you can later refer to those nodes by those names. Capture names are written after the nodes that they refer to, and start with an @ character.

For example, this pattern would match any assignment of a function to an identifier, and it would associate the name the-function-name with the identifier:

(assignment_expression
left: (identifier) @the-function-name
right: (function))

And this pattern would match all method definitions, associating the name the-method-name with the method name, the-class-name with the containing class name:

(class_declaration
name: (identifier) @the-class-name
body: (class_body
(method_definition
name: (property_identifier) @the-method-name)))

Quantification Operators

You can match a repeating sequence of sibling nodes using the postfix + and * repetition operators, which work analogously to the + and * operators in regular expressions. The + operator matches one or more repetitions of a pattern, and the * operator matches zero or more.

For example, this pattern would match a sequence of one or more comments:

(comment)+

This pattern would match a class declaration, capturing all of the decorators if any were present:

(class_declaration
(decorator)* @the-decorator
name: (identifier) @the-name)

You can also mark a node as optional using the ? operator. For example, this pattern would match all function calls, capturing a string argument if one was present:

(call_expression
function: (identifier) @the-function
arguments: (arguments (string)? @the-string-arg))

Grouping Sibling Nodes

You can also use parentheses for grouping a sequence of sibling nodes. For example, this pattern would match a comment followed by a function declaration:

(
(comment)
(function_declaration)
)

Any of the quantification operators mentioned above (+, *, and ?) can also be applied to groups. For example, this pattern would match a comma-separated series of numbers:

(
(number)
("," (number))*
)

Alternations

An alternation is written as a pair of square brackets ([]) containing a list of alternative patterns. This is similar to character classes from regular expressions ([abc] matches either a, b, or c).

For example, this pattern would match a call to either a variable or an object property. In the case of a variable, capture it as @function, and in the case of a property, capture it as @method:

(call_expression
function: [
(identifier) @function
(member_expression
property: (property_identifier) @method)
])

This pattern would match a set of possible keyword tokens, capturing them as @keyword:

[
"break"
"delete"
"else"
"for"
"function"
"if"
"return"
"try"
"while"
] @keyword

Wildcard Node

A wildcard node is represented with an underscore (_), it matches any node. This is similar to . in regular expressions. There are two types, (_) will match any named node, and _ will match any named or anonymous node.

For example, this pattern would match any node inside a call:

(call (_) @call.inner)

Anchors

The anchor operator, ., is used to constrain the ways in which child patterns are matched. It has different behaviors depending on where it's placed inside a query.

When . is placed before the first child within a parent pattern, the child will only match when it is the first named node in the parent. For example, the below pattern matches a given array node at most once, assigning the @the-element capture to the first identifier node in the parent array:

(array . (identifier) @the-element)

Without this anchor, the pattern would match once for every identifier in the array, with @the-element bound to each matched identifier.

Similarly, an anchor placed after a pattern's last child will cause that child pattern to only match nodes that are the last named child of their parent. The below pattern matches only nodes that are the last named child within a block.

(block (_) @last-expression .)

Finally, an anchor between two child patterns will cause the patterns to only match nodes that are immediate siblings. The pattern below, given a long dotted name like a.b.c.d, will only match pairs of consecutive identifiers: a, b, b, c, and c, d.

(dotted_name
(identifier) @prev-id
.
(identifier) @next-id)

Without the anchor, non-consecutive pairs like a, c and b, d would also be matched.

The restrictions placed on a pattern by an anchor operator ignore anonymous nodes.

Predicates

You can also specify arbitrary metadata and conditions associated with a pattern by adding predicate S-expressions anywhere within your pattern. Predicate S-expressions start with a predicate name beginning with a # character. After that, they can contain an arbitrary number of @-prefixed capture names or strings.

For example, this pattern would match identifier whose names is written in SCREAMING_SNAKE_CASE:

(
(identifier) @constant
(#match? @constant "^[A-Z][A-Z_]+")
)

And this pattern would match key-value pairs where the value is an identifier with the same name as the key:

(
(pair
key: (property_identifier) @key-name
value: (identifier) @value-name)
(#eq? @key-name @value-name)
)

Note - Predicates are not handled directly by the Tree-sitter C library. They are just exposed in a structured form so that higher-level code can perform the filtering. However, higher-level bindings to Tree-sitter like the Rust crate or the WebAssembly binding implement a few common predicates like #eq? and #match?.

The Query API

Create a query by specifying a string containing one or more patterns:

const TSLanguage *language,
const char *source,
uint32_t source_len,
uint32_t *error_offset,
TSQueryError *error_type
);
TSQueryError
Definition: api.h:135
TSQuery * ts_query_new(const TSLanguage *language, const char *source, uint32_t source_len, uint32_t *error_offset, TSQueryError *error_type)
Definition: query.c:2546
const char * source
Definition: lz4.h:699
Definition: query.c:270

If there is an error in the query, then the error_offset argument will be set to the byte offset of the error, and the error_type argument will be set to a value that indicates the type of error:

typedef enum {
@ TSQueryErrorSyntax
Definition: api.h:137
@ TSQueryErrorNodeType
Definition: api.h:138
@ TSQueryErrorNone
Definition: api.h:136
@ TSQueryErrorField
Definition: api.h:139
@ TSQueryErrorCapture
Definition: api.h:140

The TSQuery value is immutable and can be safely shared between threads. To execute the query, create a TSQueryCursor, which carries the state needed for processing the queries. The query cursor should not be shared between threads, but can be reused for many query executions.

TSQueryCursor * ts_query_cursor_new(void)
Definition: query.c:2820

You can then execute the query on a given syntax node:

void ts_query_cursor_exec(TSQueryCursor *, const TSQuery *, TSNode)
Definition: query.c:2859

You can then iterate over the matches:

typedef struct {
TSNode node;
uint32_t index;
typedef struct {
uint16_t pattern_index;
uint16_t capture_count;
const TSQueryCapture *captures;
bool ts_query_cursor_next_match(TSQueryCursor *, TSQueryMatch *match)
Definition: query.c:3690
int id
Definition: op.c:540
unsigned short uint16_t
Definition: sftypes.h:30
Definition: engine.c:71

This function will return false when there are no more matches. Otherwise, it will populate the match with data about which pattern matched and which nodes were captured.

Static Node Types

In languages with static typing, it can be helpful for syntax trees to provide specific type information about individual syntax nodes. Tree-sitter makes this information available via a generated file called node-types.json. This node types file provides structured data about every possible syntax node in a grammar.

You can use this data to generate type declarations in statically-typed programming languages. For example, GitHub's Semantic uses these node types files to generate Haskell data types for every possible syntax node, which allows for code analysis algorithms to be structurally verified by the Haskell type system.

The node types file contains an array of objects, each of which describes a particular type of syntax node using the following entries:

Basic Info

Every object in this array has these two entries:

  • "type" - A string that indicates which grammar rule the node represents. This corresponds to the ts_node_type function described above.
  • "named" - A boolean that indicates whether this kind of node corresponds to a rule name in the grammar or just a string literal. See above for more info.

Examples:

{
"type": "string_literal",
"named": true
}
{
"type": "+",
"named": false
}

Together, these two fields constitute a unique identifier for a node type; no two top-level objects in the node-types.json should have the same values for both "type" and "named".

Internal Nodes

Many syntax nodes can have children. The node type object describes the possible children that a node can have using the following entries:

  • "fields" - An object that describes the possible fields that the node can have. The keys of this object are field names, and the values are child type objects, described below.
  • "children" - Another child type object that describes all of the node's possible named children without fields.

A child type object describes a set of child nodes using the following entries:

  • "required" - A boolean indicating whether there is always at least one node in this set.
  • "multiple" - A boolean indicating whether there can be multiple nodes in this set.
  • "types"- An array of objects that represent the possible types of nodes in this set. Each object has two keys: "type" and "named", whose meanings are described above.

Example with fields:

{
"type": "method_definition",
"named": true,
"fields": {
"body": {
"multiple": false,
"required": true,
"types": [{ "type": "statement_block", "named": true }]
},
"decorator": {
"multiple": true,
"required": false,
"types": [{ "type": "decorator", "named": true }]
},
"name": {
"multiple": false,
"required": true,
"types": [
{ "type": "computed_property_name", "named": true },
{ "type": "property_identifier", "named": true }
]
},
"parameters": {
"multiple": false,
"required": true,
"types": [{ "type": "formal_parameters", "named": true }]
}
}
}

Example with children:

{
"type": "array",
"named": true,
"fields": {},
"children": {
"multiple": true,
"required": false,
"types": [
{ "type": "_expression", "named": true },
{ "type": "spread_element", "named": true }
]
}
}

Supertype Nodes

In Tree-sitter grammars, there are usually certain rules that represent abstract categories of syntax nodes (e.g. "expression", "type", "declaration"). In the grammar.js file, these are often written as hidden rules whose definition is a simple choice where each member is just a single symbol.

Normally, hidden rules are not mentioned in the node types file, since they don't appear in the syntax tree. But if you add a hidden rule to the grammar's supertypes list, then it will show up in the node types file, with the following special entry:

  • "subtypes" - An array of objects that specify the types of nodes that this 'supertype' node can wrap.

Example:

{
"type": "_declaration",
"named": true,
"subtypes": [
{ "type": "class_declaration", "named": true },
{ "type": "function_declaration", "named": true },
{ "type": "generator_function_declaration", "named": true },
{ "type": "lexical_declaration", "named": true },
{ "type": "variable_declaration", "named": true }
]
}

Supertype nodes will also appear elsewhere in the node types file, as children of other node types, in a way that corresponds with how the supertype rule was used in the grammar. This can make the node types much shorter and easier to read, because a single supertype will take the place of multiple subtypes.

Example:

{
"type": "export_statement",
"named": true,
"fields": {
"declaration": {
"multiple": false,
"required": false,
"types": [{ "type": "_declaration", "named": true }]
},
"source": {
"multiple": false,
"required": false,
"types": [{ "type": "string", "named": true }]
}
}
}