Tag Archives: grammar

ANTLR with C# – adding actions to rules

In the last post, we saw how to install ANTLR in Visual Studio and write a simple grammar. We could run the parser and get it to check some input for errors, but so far nothing happens when a correct string is input. In this post, we’ll see how to add some actions for each rule in the grammar.

Although the grammar rules are written in ANTLR’s special syntax, actions are written in the language for which ANTLR produces its code, which in this case is C#. We’ll begin with a quick review of our grammar file so far:

grammar Calculator;

options {
    language=CSharp3;
}

@lexer::namespace{AntlrTest01}
@parser::namespace{AntlrTest01}

/*
 * Parser Rules
 */

public addSubExpr
	: multDivExpr (( '+' | '-' ) multDivExpr)*;

multDivExpr
  : INT (( '*' | '/' ) INT)*;

/*
 * Lexer Rules
 */

INT : '0'..'9'+;
WS :  (' '|'\t'|'\r'|'\n')+ {Skip();} ;

The grammar accepts a single line of input consisting of the four arithmetic operations acting on non-negative integers. No variables or parentheses are allowed. As such, there are four operations for which we need to add actions. Well, actually there is one more action, since we need to get the program to convert a string representation of an integer to C#’s int data type, and we’re allowing a single integer as valid input.

Code for an action is inserted directly into the rules at the point where we want it to be run. The easiest way to see what’s going on is to have a look at the modified grammar file with actions included. Here it is:

grammar Calculator;

options {
    language=CSharp3;
}

@lexer::namespace{AntlrTest01}
@parser::namespace{AntlrTest01}

@header {
using System;
}
/*
 * Parser Rules
 */

public addSubExpr returns [int value]
	: a = multDivExpr {$value = a;}
  ( '+' b = multDivExpr {$value += b;}
  | '-' b = multDivExpr {$value -= b;})*;

multDivExpr returns [int value]
  : a = INT {$value = Int32.Parse($a.text);}
  ( '*' b = INT {$value *= Int32.Parse($b.text);}
  | '/' b = INT {$value /= Int32.Parse($b.text);})*;

/*
 * Lexer Rules
 */

INT : '0'..'9'+;
WS :  (' '|'\t'|'\r'|'\n')+ {Skip();} ;

We’re going to have the grammar return the result of the calculation so it can be used elsewhere in our master program.

First, notice the @header statement on line 10. You place any intitalization statements that the C# code will need in here. Since we’ll need C#’s System.Int32.Parse() method to convert from a string to an int, we put a ‘using System’ statement in the header. We could, of course, just write all calls to Parse() using its full path, but as with regular C# code, the using statement saves a lot of typing.

Now have a look at the parser and lexer rules themselves. The lexer rules haven’t changed at all, since all the lexer does is extract tokens from the input and feed them to the parser, and the parser is where all the work is done on interpreting the input.

Next, look at the multDivExpr rule on line 22. This rule should calculate the product or quotient of two integers and return the result. We’ve added a ‘returns [int value]’ phrase to this rule. This is ANTLR’s syntax for defining the return value from a rule. (Actually, the latest version of ANTLR allows more than one value to be returned (it encapsulates all return values into a new class), but we won’t need that here.) Parameter lists in ANTLR are enclosed in square brackets rather than parentheses as in C#. The ‘value’ parameter is a local variable whose scope is the entire rule in which it’s defined.

On line 23, we deal with the part of the rule which matches the first integer. We define another local variable, ‘a’, to hold the contents of the token INT. Remember that at this stage, INT is a string of digits, and not an int. Thus we need to add an action that converts INT to a C# int, and that’s what the code in braces on line 23 does. The $a.text is ANTLR syntax which extracts the string from the local variable ‘a’. When you need to access an ANTLR variable, you prefix it with $.

Line 24 deals with multiplication. We save the second INT in the variable ‘b’, and return the product of the current $value with the int extracted from ‘b’. The variable ‘a’ is still in scope on line 24, so we could have written for this line:

  ( '*' b = INT {$value = Int32.Parse($a.text) * Int32.Parse($b.text);}

However, using the current value of $value is easier.

Line 25 handles division in the same way.

The addSubExpr rule on line 17 works in pretty much the same way, although there are a couple of points to note.

First, we’ve declared addSubExpr as ‘public’. By default, the code ANTLR produces for each rule is private, so you won’t be able to access it from outside the parser class. Since addSubExpr is the top level rule in our grammar, it’s the one we’ll need to call from outside to parse the input, so we need to make sure it’s public.

Also, note on line 18 that we save the result of the first multDivExpr in a local variable ‘a’, and we assign ‘a’ directly to $value without using a $ in front of the ‘a’ or accessing any of its properties. We can do this because multDivExpr was defined as returning an int, and an int is just what we need in the calculation. In more complex cases, we might use some non-primitive data type as the return type, in which case we’d need to access one of its fields.

Lines 19 and 20 use the same technique for performing the addition and subtraction.

The code that calls this modified version of the parser is

using System;
using System.IO;
using Antlr.Runtime;
using Antlr.Runtime.Misc;

namespace AntlrTest01
{
  class Program
  {
    public static void Main(string[] args)
    {
      Stream inputStream = Console.OpenStandardInput();
      ANTLRInputStream input = new ANTLRInputStream(inputStream);
      CalculatorLexer lexer = new CalculatorLexer(input);
      CommonTokenStream tokens = new CommonTokenStream(lexer);
      CalculatorParser parser = new CalculatorParser(tokens);
      int answer = parser.addSubExpr();
      Console.WriteLine("Answer = " + answer);
    }
  }
}

This is the same as in the previous post, except that this time we save the return value from the call to the parser’s addSubExpr() method and print it out as the answer.

In the next post, we’ll have a look at ANTLR’s other method defining grammars: the abstract syntax tree or AST.

Advertisements

ANTLR with C# – a simple grammar

This post describes how to get started with Antlr version 3. For version 4, see here (note that Antlr4 breaks most of the functionality in previous versions).

ANTLR (ANother Tool for Language Recognition) is a tool for defining and implementing your own computer languages. Most of us won’t ever create a fully-fledged language like C# or Java, but it is often useful and efficient to create little languages for specialist situations, rather than writing clumsy string-parsing code. ANTLR is an ideal tool for handling situations like that, although it can also tackle large problems.

ANTLR is written in Java, and most of the examples (and the only book I know of; there is a new edition of this book due out later this year, but from the blurb it seems it will still concentrate on Java) are given in Java. However, ANTLR is available for quite a few languages and since I use mainly C# I thought a post on getting started with ANTLR in C# and Visual Studio would be useful.

First, we’ll look at getting ANTLR installed in Visual Studio. Start by visiting the C# download page. Download the antlr-dotnet-csharpbootstrap file and the ANTLR language support listed under Visual Studio 2010 Extensions. You should also get the documentation using the link at the top of the download page, although this documentation is a bit sketchy.

Install the VS extension (it’s a self-installing file). I had a problem with the installation in that the Intellisense for grammar files wouldn’t work, and Visual Studio complained of an exception whenever a grammar file was loaded. Eventually I discovered this was due to one of the files in the extension installation not being put in the right place. If you get this problem, open up the extension installer file Tvl.VisualStudio.Language.Antlr3.vsix (it’s in standard zip format so WinZip or WinRar should open it) and copy the file Tvl.VisualStudio.Shell.OutputWindow.Interfaces.dll to C:\Program Files (x86)\Microsoft Visual Studio 10.0\Common7\IDE\CommonExtensions\Microsoft\Editor. (The (x86) in the Program Files folder will be there if you’re running a 32-bit version of Visual Studio under 64-bit Windows 7. If you’re using 32-bit Windows 7, it’s just the Program Files folder.)

Setting up an ANTLR project in Visual Studio

There are a few steps that need to be performed each time you create an ANTLR project in VS2010. These steps are described in the documentation for a project called CoolTool, but I’ll run through them here for completeness.

First, create your new project (we’ll call it AntlrTest01) in VS. Then create a new folder called Reference in the solution’s directory, and unzip the contents of the bootstrap file you downloaded above into this folder. You’ll need to copy this folder to each new project unless you want to store it in a central location and refer to it in the next step, but be aware that if you want to distribute your program, these files will need to go along with it, so it’s often handier to just have a copy of them with your project.

Back in VS, unload your project by right-clicking on its name in Solution Explorer and selecting ‘Unload project’.  Then edit the csproj file by right-clicking on the project name again and selecting ‘Edit AntlrTest.csproj’. Scroll down to the bottom of this file (it’s in XML, so moderately readable) and find the line

  <Import Project="$(MSBuildToolsPath)\Microsoft.CSharp.targets" />

Just after this line, add the lines:

  <PropertyGroup>
    <!-- Folder containing AntlrBuildTask.dll -->
    <AntlrBuildTaskPath>$(ProjectDir)..\Reference\Antlr</AntlrBuildTaskPath>
    <!-- Path to the ANTLR Tool itself. -->
    <AntlrToolPath>$(ProjectDir)..\Reference\Antlr\Antlr3.exe</AntlrToolPath>
  </PropertyGroup>
  <Import Project="$(ProjectDir)..\Reference\Antlr\Antlr3.targets" />

This points VS at the ANTLR files you unzipped earlier. You can see the files’ location is given relative to the ProjectDir, so if you want to install your files somewhere else, you’ll need to change path in the csproj file.

When you’re done, save the csproj file and then right click on the project name and select ‘Reload project’.

One final task is needed before we can start coding. You’ll need to add a reference to the ANTLR dll in your project. In Solution Explorer, right click on References and select Add Reference, then click the Browse tab. The file you want is Antlr3.Runtime.dll, and it’s in the Antlr folder which is inside the References folder that you unzipped the file into earlier. You’re now ready to write your ANTLR project.

A simple grammar

An ANTLR grammar is defined by a grammar, which is a description of the allowable words and symbols and of the rules that combine these words and symbols. A grammar in ANTLR is written using a special syntax (that is, not in C#) so you’ll need to get familiar with at least some of that syntax to write a simple grammar.

There are actually two ways you can get ANTLR to implement a language. The first is by writing the C# code that is to be executed as part of each grammar rule. Although this is the simplest way, it’s not very efficient since it requires that the input string (the code in which language statements are written) be parsed from scratch each time it is to be interpreted. A more efficient way is to convert the input code into an abstract syntax tree or AST, and then use the AST for future operations. We’ll look only at the first method in this post.

To start with, we’ll write a grammar that implements a simple four-function calculator. We won’t add any frills, so we won’t include things like variable names or even parentheses.

Begin by adding a combined lexer and parser to your project. Right click on the project name and select Add New Item. If you’ve installed the VS extension above properly, you should see four ANTLR items in the list of Visual C# Items. Select ANTLR Combined Parser, change its name to Calculator.g3 and click Add. You’ll see that three files have been added to your project: the grammar file Calculator.g3 and two C# source files, one for the lexer and one for the parser.

By default, the grammar file generated in VS produces the AST version of the parser, which isn’t what we want in this simple example. To change it, edit the ‘options’ section so that the start of the file looks like this:

grammar Calculator;

options {
    language=CSharp3;
}

@lexer::namespace{AntlrTest01}
@parser::namespace{AntlrTest01}

ANTLR generates two classes (one for the lexer and one for the parser) from the grammar file, and the last two lines above tell it to put these classes in the AntlrTest01 namespace (the same namespace that contains the rest of the project).

We’ve mentioned the words ‘lexer’ and ‘parser’ a few times without defining them. These two beasts are fundamental to creating a language generator. Basically a lexer reads the input stream (usually a string) and breaks it up into tokens. It is the job of the lexer to identify which are valid tokens and which aren’t, and to generate error messages if it finds invalid input at the token level. For example, in our simple grammar we will accept integers and the four operators +, -, * and / as valid tokens. Anything else should produce an error.

The lexer feeds the tokens to the parser, which attempts to match the tokens to the rules we will define in the grammar file. We will need to add code to the parser so that it performs some function when some tokens match a valid rule (for example, when the expression 3 + 4 is fed into the parser), or generates an error message if the tokens don’t match a rule.

The first step in writing a language is defining the grammar. At that stage, the only function of the lexer and parser is to check the input to see if it’s valid, without doing any actual calculations. The easiest way to see how this works is to look at the complete grammar file:

grammar Calculator;

options {
    language=CSharp3;
}

@lexer::namespace{AntlrTest01}
@parser::namespace{AntlrTest01}

/*
 * Parser Rules
 */

public addSubExpr
	: multDivExpr (( '+' | '-' ) multDivExpr)*;

multDivExpr
  : INT (( '*' | '/' ) INT)*;

/*
 * Lexer Rules
 */

INT : '0'..'9'+;
WS :  (' '|'\t'|'\r'|'\n')+ {Skip();} ;

Look at the lexer rules (at the bottom) first. We’ve defined two tokens. INT is the token for an integer, and consists of the characters 0 through 9 repeated one or more times. The + sign at the end is the usual regular expression syntax for ‘one or more occurrences’. WS stands for ‘whitespace’ and defines which characters are considered as blanks. Here we’ve include the blank itself, and the tab, return and newline characters. The Skip() function in braces at the end tells the lexer to skip over whitespace when it’s reading the input.

Now have a look at the parser. We’ve defined two rules: addSubExpr and multDivExpr. multDivExpr is defined as an INT followed by zero or more expressions of the form: * or / followed by another INT. (The * at the end is again regular expression syntax for ‘zero or more occurrences’.) Thus a valid multDivExpr would be a single integer like 42, or expressions like 42*12 or 42*12/6. Notice that since we’ve told the lexer to ignore whitespace, we can insert blanks between the tokens and it won’t affect the result.

Finally, look at the addSubExpr rule. It consists of a single multDivExpr, or a multDivExpr followed by zero or more expressions of the form + or – followed by another multDivExpr. So a valid addSubExpr would be, again, a single integer, or a single multDivExpr like those in the previous paragraph, or an expression like 42*12+34, or 42*12+34/2 or 42*12+34-22 and so on.

If you haven’t made any errors typing this in, you should now be able to build the project in VS without errors. However, we haven’t actually written any C# code yet, so if you run the program nothing will happen.

Before we add some C#, though, it’s worth looking at the Class View in VS. You’ll see that there are two classes called CalculatorLexer and CalculatorParser and if you look at their list of methods and properties, you’ll see a whole pile of entries. This is the code that ANTLR generates from your grammar file. Feel free to have a look at the code, but don’t modify any of it since (a) it will probably break it and (b) any changes you make will get overwritten the next time you change the grammar file.

Finally, how do we incorporate the grammar into a C# program? Here’s what is needed to use the lexer and parser in C#:

using System;
using System.IO;
using Antlr.Runtime;
using Antlr.Runtime.Misc;

namespace AntlrTest01
{
  class Program
  {
    public static void Main(string[] args)
    {
      Stream inputStream = Console.OpenStandardInput();
      ANTLRInputStream input = new ANTLRInputStream(inputStream);
      CalculatorLexer lexer = new CalculatorLexer(input);
      CommonTokenStream tokens = new CommonTokenStream(lexer);
      CalculatorParser parser = new CalculatorParser(tokens);
      parser.addSubExpr();
    }
  }
}

Note the ‘using’ statements at the top that we need to get access to the ANTLR libraries.

First we open the standard input (via the console) as the input stream and use this to create an ANTLRInputStream. This stream is passed to the lexer, which produces a stream of tokens which is then passed to the parser. Finally we call addSubExpr(), which tries to apply the addSubExpr rule to the input.

We can run this program and then type some input into the console. When you’re done with the input, type Control-Z on a line by itself. Control-Z is the Windows character for ‘end of file’ and tells the parser the input is complete.

What you’ll find is that, no matter what you type in (even it’s incorrect), you’ll get no output. This is because we haven’t provided either code for handling correct input or error messages for incorrect input. We’ll leave the former for the next post, but we can provide some rudimentary error handling to finish off this post.

When a parser or lexer method detects invalid input, it calls the method ReportError(). This method is provided only as a virtual stub in the generated code, so if you want some output when an error is detected, you’ll need to override it. We can place the following method in Calculator.g3.lexer.cs:

using Antlr.Runtime;
using System;

namespace AntlrTest01
{
  partial class CalculatorLexer
  {
    public override void ReportError(RecognitionException e)
    {
      base.ReportError(e);
      Console.WriteLine("Error in lexer at line " + e.Line + ":" + e.CharPositionInLine);
    }
  }
}

And a similar method override can be placed in Calculator.g3.Parser.cs:

using Antlr.Runtime;
using System;

namespace AntlrTest01
{
  partial class CalculatorParser
  {
    public override void ReportError(RecognitionException e)
    {
      base.ReportError(e);
      Console.WriteLine("Error in parser at line " + e.Line + ":" + e.CharPositionInLine);
    }
  }
}

Now if you enter some incorrect input like 3+d you’ll get an error message:

3+d
^Z
Error in lexer at line 1:2
Error in parser at line 2:0

The lexer complains about character 2 (the location is zero-based so that’s the third character), since ‘d’ isn’t a valid token (remember we are accepting only integers, whitespace and the four operators). The parser complains about an error at the start of line 2, since the lexer has fed it only the two tokens 3 and +, which doesn’t match either of the parser’s rules.

Hopefully that’s enough to get started with ANTLR. Next time round we’ll look at adding some code to get the parser to actually do the calculations.