Monthly Archives: July 2012

ANTLR with C# – using an Abstract Syntax Tree (AST)

The ANTLR example we’ve seen in the last two posts (part 1 and part 2) produced a simple calculator that accepted integers and the four arithmetic operators and calculated the answer. This sort of parser is fine if you need to parse the input once only. However, suppose we wanted to write an application that allowed the user to enter a mathematical function f(x) as an algebraic expression and then draw a graph of that function between two values of x. In that case, we’d need to generate a number of values of f(x) for various values of x between the endpoints. One way of doing this is to parse the input string repeatedly, each time passing in the required value of x. However, this is rather inefficient, and ANTLR offers a better alternative: generating an abstract syntax tree or AST, and then using the AST to evaluate the input expression.

So what is an AST? To understand this, we need to understand what a parser does when it parses an input string. For example, with our simple calculator, we can give the input string 3+4*5. Since multiplication has a higher precedence than addition, 4*5 is done first. The parser treats the input as a tree like this:

To evaluate the tree, it is traversed or ‘walked’ in a depth first manner, which means that we start at the root ‘+’ and go as deep into the tree as we can along each child, then evaluate the operations as we come back out of the tree. Thus going down the left branch we encounter 3, which is saved until the result of the right branch is found. Going down the right branch as far as possible we encounter * then 4. Backing up from 4 to * we then go down the right branch and find 5. Both children of * are now determined so we can apply the * operator to get 20. This determines the right branch of the +, so we can now apply that operator and get 3+20=23.

In the more general situation mentioned above, where we need to evaluate the input several times, perhaps with different values at some of the nodes, it would clearly be less work if we didn’t have to generate the tree afresh each time before we traverse it. ANTLR allows us to generate an AST from a grammar file and then define a tree grammar which uses the AST as input, rather than the original string. Essentially what we need to do is write a grammar in the usual way and then write a second grammar that operates on the tree nodes rather than the original grammar rules. This might sound like more work (OK, it is), but usually writing the tree grammar is easier than writing the original grammar, and we are rewarded with a more efficient program.

To illustrate how this is done, we’ll expand our calculator grammar to an algebraic function evaluator. To this end, we want the numbers it uses to be doubles instead of ints. We also want to support a few more operations, such as raising to a power (using the ^ operator) and the unary minus for negating an expression. The user can enter an algebraic expression using x as the variable, and then specify the range of x values over which f(x) is to be calculated. Here’s the complete grammar file which implements this, and which generates the AST. We’ll explain the new syntax required for AST generation afterwards.

```grammar Polynomial;

options {
language=CSharp3;
TokenLabelType=CommonToken;
output=AST;
ASTLabelType=CommonTree;
}

tokens {
UNARYMINUS;
}

@lexer::namespace{Polynomial}
@parser::namespace{Polynomial}

// START:expr
public startProg
: expr { System.Console.WriteLine(\$expr.tree.ToStringTree());};

expr
:  multExpr
( '+'^ multExpr
| '-'^ multExpr
)*
;

multExpr
:   powerExpr
('*'^ powerExpr
| '/'^ powerExpr
)*
;

powerExpr
:   unaryExpr ('^'^ unaryExpr)?
;

unaryExpr
:   '-' atom -> ^(UNARYMINUS atom)
|   atom
;

atom
:   DOUBLE
|   ID
|   '('! expr ')'!
;
// END:expr

// START:tokens
ID  :  'x' ;
DOUBLE :   '-'? '0'..'9'+ ('.' '0'..'9'*)?;
// Use uppercase Skip() for C#
WS  :   (' '|'\t'|'\r'|'\n')+ {Skip();} ;
// END:tokens
```

(If you’re wondering about the grammar name, I originally designed the grammar for calculating polynomials, but it grew in the telling.)

Note that we’ve added 3 lines to the options section on lines 5 to 7. Actually, the options are those generated by default by the Visual Studio plugin, so you can just leave them as they are if you’re starting a new grammar file.

On line 10 we have a ‘tokens’ section, in which we define a single token, UNARYMINUS. Since the – sign is used for two purposes (subtraction and negation), the parser gets confused at the tree stage, so I’ve used a token as a label for the unary minus. We’ll see how this works when we define the tree grammar in a minute.

The remainder of the grammar should be fairly self-explanatory if you’ve read the earlier posts, except for the bits that generate the AST. This bit does require some careful thought as to what nodes you want to be in the AST and how they should be structured.

We’ve inserted a startProg rule at line 18. This isn’t really needed here, but when you’re starting out with ASTs it’s useful to see that actual tree that is produced, so that’s what this rule does in addition to calling the expr rule which is where the actual parsing takes place.

Now look at the ‘expr’ rule on line 21. We’ve defined it as a multExpr on its own, or as two multExprs joined by + or -. The solo multExpr on line 22 is unadorned since we want this node to be placed in the AST as it is.

The + rule on line 23 however has 3 parts to it (the left and right multExpr operands and the + operator). Comparing with the diagram above, we see that we want this expression to be placed in the tree with the + as the root and the two multExprs as its children. This is indicated on line 23 by placing a ^ after the term that is to be the root of the node, in this case, the + sign.

The same technique is used in creating the other AST nodes in the expr, multExpr and powerExpr rules. The unaryExpr on line 39 is a bit different. In order to distinguish this use of ‘-’ from the subtraction rule in expr, we want the node in the AST to use the UNARYMINUS token as the root node rather than a bare ‘-’ symbol. To do this we’ve used a rewrite rule. This is defined by using the arrow -> followed by the form we want the node in the AST to have for this rule. Nodes in the AST always have the form (root child1 child2…), that is, the first node is the root and the other nodes are its children. Thus here UNARYMINUS is the root and the atom is its single child.

Finally, look at the atom rule on line 44. An atom consists of a DOUBLE, which matches a double floating point number, or an ID, which here we’ve restricted to the single variable name ‘x’, or an expr in parentheses. The parentheses serve only to fence off an expression from other terms on either side, and once the expr has been identified, the parenthses are no longer needed. We therefore don’t need them in the AST. To exclude a term from the AST, place a ! after it, as we’ve done here.

Now we can look at the tree grammar. To create a tree grammar file, you can use Visual Studio’s Add New Item dialog and select ANTLR Tree Grammar. Here’s the complete file:

```tree grammar PolynomialTree;

options {
language=CSharp3;
ASTLabelType=CommonTree;
tokenVocab=Polynomial;
}

@namespace{Polynomial}
using System;
}

// START:node
public start[double x] returns [double value]
: a = node[x] {\$value = a;};

node [double x] returns [double value]
:   ^('*' a = node[x] b = node[x]) {\$value = a * b;}
|   ^('/' a = node[x] b = node[x]) {\$value = a / b;}
|   ^('+' a = node[x] b = node[x]) {\$value = a + b;}
|   ^('-' a = node[x] b = node[x]) {\$value = a - b;}
|   ^('^' a = node[x] b = node[x]) {\$value = Math.Pow(a, b);}
|   ^(UNARYMINUS a = node[x]) {\$value = -a;}
|   ID {\$value = x;}
|   DOUBLE {\$value = Double.Parse(\$DOUBLE.text);}
;

// END:node
```

Notice this is defined as a ‘tree grammar’ (not just a ‘grammar’) on line 1. In the options on line 6 we’ve specified a ‘tokenVocab’. When ANTLR processes the original grammar file it produces a file containing the tokens used in that file, and to ensure that the tree grammar uses the same tokens, we load in that file. The tokens file has the same name as the grammar with the suffix ‘.tokens’. If you want to see it, it’s located under your project folder in obj\x86\Debug.

The start rule on line 15 is the entry point into the tree. Note that rule names in the tree grammar don’t have to match those in the original grammar, since you’re effectively defining a new grammar with tree nodes as input instead of a string. You might want to make the names the same, but I’ve kept them different here to show that they are in fact separate entities.

Since we want to walk the tree with various values for x, we need to define the start rule so that it accepts a parameter. This is done by enclosing the parameter in square brackets (NOT parentheses, as you’d do in a normal C# method call). The start rule also returns the result of the calculation as ‘value’. Its only action (on line 16) is to call a ‘node’ and pass x along to that node. Again, note that square brackets are used to call a rule with a parameter.

The meat of the tree grammar is in the ‘node’ rule on line 18. We list all the node types that can occur and define an action for each. We must begin each compound node (one that contains more than one term) with a ^ and enclose the node in parentheses; apart from that, it’s much the same as a rule in the original grammar.

On line 24, we make use of the UNARYMINUS token to recognize a unary minus operator. On line 25, the value of the parameter x is assigned whenever an ID is encountered in the tree, and on line 26 we parse a double floating point number.

We don’t have any references to the various rules like expr, multExpr and so on that were in the original grammar, since the original grammar took care of all that and built an AST where the nodes had a much more uniform structure. The precedence of the various operators is built into the AST (remember that lower down nodes are processed first in a depth-first traversal), so we don’t need to specify that either.

Finally, we need some C# code to use the AST to do some real calculations. Here’s the program:

```using Antlr.Runtime;
using Antlr.Runtime.Tree;
using System;
using System.IO;
using System.Text;

namespace Polynomial
{
class Program
{
static void Main(string[] args)
{
Console.Write("Enter expression: ");
Console.Write("Enter low value of x: ");
Console.Write("Enter high value of x: ");
Stream exprStream = new MemoryStream(ASCIIEncoding.Default.GetBytes(expression));

// Use the parser to build the AST
ANTLRInputStream input = new ANTLRInputStream(exprStream);
PolynomialLexer lexer = new PolynomialLexer(input);
CommonTokenStream tokens = new CommonTokenStream(lexer);
PolynomialParser parser = new PolynomialParser(tokens);
var result = parser.startProg();

// Use the tree to do the evaluation
CommonTree tree = (CommonTree)result.Tree;
CommonTreeNodeStream nodes = new CommonTreeNodeStream(tree);
PolynomialTree walker = new PolynomialTree(nodes);
double xStep = (xHigh - xLow) / 10.0;
for (double x = xLow; x <= xHigh; x += xStep)
{
nodes.Reset();
double value = walker.start(x);
Console.WriteLine("f(" + x + ") = " + value);
}
}
}
}
```

On lines 13 to 18 we read in the data from the user. The parser needs to read its input from an ANTLRInputStream, and that requires a C# Stream object, so we need to convert the string containing the algebraic expression into a Stream, which is done on line 19.

Next we need to call the parser based on the original grammar to build the AST. This is done on lines 22 to 26. These steps are pretty much the same as in the original use of the parser from the previous post. The difference is that the ‘result’ returned from the parser on line 26 is an AST rather than the result of a calculation. (Its actual data type is pretty horrendous, so we’ve used a ‘var’ to declare it.) This call to startProg() will print out the AST.

Once we’ve got the AST, we build the tree parser on lines 28 to 31. The CommonTreeNodeStream on line 30 acts as an input stream for the tree parser, and the PolynomialTree object ‘walker’ on line 31 is the parser that will do the evaluation.

The loop on line 33 calls the walker for each value of x and prints out the result. Note that we need reset the ‘nodes’ stream before each call since after the walker processes this stream, the marker in the stream is at the end of the input.

Here’s a typical run of the program:

```Enter expression: x^2 - (-3.14 + (4.5 - 9.03/x))*(5.43 - x)
Enter low value of x: 10
Enter high value of x: 30
(- (^ x 2) (* (+ -3.14 (- 4.5 (/ 9.03 x))) (- 5.43 x)))
f(10) = 102.08849
f(12) = 147.991275
f(14) = 202.12755
f(16) = 264.40975625
f(18) = 334.78925
f(20) = 413.236845
f(22) = 499.733968181818
f(24) = 594.2682375
f(26) = 696.831080769231
f(28) = 807.416375
f(30) = 926.01963
```

There’s no error checking, so if the user makes a mistake entering the expression or enters a value of x that causes a math error (like division by zero), the program will just crash, but error handling is a whole new ball game (and is probably harder than writing the original grammar), so we’ll leave it here for now.