ANTLR with C# – a simple grammar

This post describes how to get started with Antlr version 3. For version 4, see here (note that Antlr4 breaks most of the functionality in previous versions).

ANTLR (ANother Tool for Language Recognition) is a tool for defining and implementing your own computer languages. Most of us won’t ever create a fully-fledged language like C# or Java, but it is often useful and efficient to create little languages for specialist situations, rather than writing clumsy string-parsing code. ANTLR is an ideal tool for handling situations like that, although it can also tackle large problems.

ANTLR is written in Java, and most of the examples (and the only book I know of; there is a new edition of this book due out later this year, but from the blurb it seems it will still concentrate on Java) are given in Java. However, ANTLR is available for quite a few languages and since I use mainly C# I thought a post on getting started with ANTLR in C# and Visual Studio would be useful.

First, we’ll look at getting ANTLR installed in Visual Studio. Start by visiting the C# download page. Download the antlr-dotnet-csharpbootstrap file and the ANTLR language support listed under Visual Studio 2010 Extensions. You should also get the documentation using the link at the top of the download page, although this documentation is a bit sketchy.

Install the VS extension (it’s a self-installing file). I had a problem with the installation in that the Intellisense for grammar files wouldn’t work, and Visual Studio complained of an exception whenever a grammar file was loaded. Eventually I discovered this was due to one of the files in the extension installation not being put in the right place. If you get this problem, open up the extension installer file Tvl.VisualStudio.Language.Antlr3.vsix (it’s in standard zip format so WinZip or WinRar should open it) and copy the file Tvl.VisualStudio.Shell.OutputWindow.Interfaces.dll to C:\Program Files (x86)\Microsoft Visual Studio 10.0\Common7\IDE\CommonExtensions\Microsoft\Editor. (The (x86) in the Program Files folder will be there if you’re running a 32-bit version of Visual Studio under 64-bit Windows 7. If you’re using 32-bit Windows 7, it’s just the Program Files folder.)

Setting up an ANTLR project in Visual Studio

There are a few steps that need to be performed each time you create an ANTLR project in VS2010. These steps are described in the documentation for a project called CoolTool, but I’ll run through them here for completeness.

First, create your new project (we’ll call it AntlrTest01) in VS. Then create a new folder called Reference in the solution’s directory, and unzip the contents of the bootstrap file you downloaded above into this folder. You’ll need to copy this folder to each new project unless you want to store it in a central location and refer to it in the next step, but be aware that if you want to distribute your program, these files will need to go along with it, so it’s often handier to just have a copy of them with your project.

Back in VS, unload your project by right-clicking on its name in Solution Explorer and selecting ‘Unload project’.  Then edit the csproj file by right-clicking on the project name again and selecting ‘Edit AntlrTest.csproj’. Scroll down to the bottom of this file (it’s in XML, so moderately readable) and find the line

  <Import Project="$(MSBuildToolsPath)\Microsoft.CSharp.targets" />

Just after this line, add the lines:

  <PropertyGroup>
    <!-- Folder containing AntlrBuildTask.dll -->
    <AntlrBuildTaskPath>$(ProjectDir)..\Reference\Antlr</AntlrBuildTaskPath>
    <!-- Path to the ANTLR Tool itself. -->
    <AntlrToolPath>$(ProjectDir)..\Reference\Antlr\Antlr3.exe</AntlrToolPath>
  </PropertyGroup>
  <Import Project="$(ProjectDir)..\Reference\Antlr\Antlr3.targets" />

This points VS at the ANTLR files you unzipped earlier. You can see the files’ location is given relative to the ProjectDir, so if you want to install your files somewhere else, you’ll need to change path in the csproj file.

When you’re done, save the csproj file and then right click on the project name and select ‘Reload project’.

One final task is needed before we can start coding. You’ll need to add a reference to the ANTLR dll in your project. In Solution Explorer, right click on References and select Add Reference, then click the Browse tab. The file you want is Antlr3.Runtime.dll, and it’s in the Antlr folder which is inside the References folder that you unzipped the file into earlier. You’re now ready to write your ANTLR project.

A simple grammar

An ANTLR grammar is defined by a grammar, which is a description of the allowable words and symbols and of the rules that combine these words and symbols. A grammar in ANTLR is written using a special syntax (that is, not in C#) so you’ll need to get familiar with at least some of that syntax to write a simple grammar.

There are actually two ways you can get ANTLR to implement a language. The first is by writing the C# code that is to be executed as part of each grammar rule. Although this is the simplest way, it’s not very efficient since it requires that the input string (the code in which language statements are written) be parsed from scratch each time it is to be interpreted. A more efficient way is to convert the input code into an abstract syntax tree or AST, and then use the AST for future operations. We’ll look only at the first method in this post.

To start with, we’ll write a grammar that implements a simple four-function calculator. We won’t add any frills, so we won’t include things like variable names or even parentheses.

Begin by adding a combined lexer and parser to your project. Right click on the project name and select Add New Item. If you’ve installed the VS extension above properly, you should see four ANTLR items in the list of Visual C# Items. Select ANTLR Combined Parser, change its name to Calculator.g3 and click Add. You’ll see that three files have been added to your project: the grammar file Calculator.g3 and two C# source files, one for the lexer and one for the parser.

By default, the grammar file generated in VS produces the AST version of the parser, which isn’t what we want in this simple example. To change it, edit the ‘options’ section so that the start of the file looks like this:

grammar Calculator;

options {
    language=CSharp3;
}

@lexer::namespace{AntlrTest01}
@parser::namespace{AntlrTest01}

ANTLR generates two classes (one for the lexer and one for the parser) from the grammar file, and the last two lines above tell it to put these classes in the AntlrTest01 namespace (the same namespace that contains the rest of the project).

We’ve mentioned the words ‘lexer’ and ‘parser’ a few times without defining them. These two beasts are fundamental to creating a language generator. Basically a lexer reads the input stream (usually a string) and breaks it up into tokens. It is the job of the lexer to identify which are valid tokens and which aren’t, and to generate error messages if it finds invalid input at the token level. For example, in our simple grammar we will accept integers and the four operators +, -, * and / as valid tokens. Anything else should produce an error.

The lexer feeds the tokens to the parser, which attempts to match the tokens to the rules we will define in the grammar file. We will need to add code to the parser so that it performs some function when some tokens match a valid rule (for example, when the expression 3 + 4 is fed into the parser), or generates an error message if the tokens don’t match a rule.

The first step in writing a language is defining the grammar. At that stage, the only function of the lexer and parser is to check the input to see if it’s valid, without doing any actual calculations. The easiest way to see how this works is to look at the complete grammar file:

grammar Calculator;

options {
    language=CSharp3;
}

@lexer::namespace{AntlrTest01}
@parser::namespace{AntlrTest01}

/*
 * Parser Rules
 */

public addSubExpr
	: multDivExpr (( '+' | '-' ) multDivExpr)*;

multDivExpr
  : INT (( '*' | '/' ) INT)*;

/*
 * Lexer Rules
 */

INT : '0'..'9'+;
WS :  (' '|'\t'|'\r'|'\n')+ {Skip();} ;

Look at the lexer rules (at the bottom) first. We’ve defined two tokens. INT is the token for an integer, and consists of the characters 0 through 9 repeated one or more times. The + sign at the end is the usual regular expression syntax for ‘one or more occurrences’. WS stands for ‘whitespace’ and defines which characters are considered as blanks. Here we’ve include the blank itself, and the tab, return and newline characters. The Skip() function in braces at the end tells the lexer to skip over whitespace when it’s reading the input.

Now have a look at the parser. We’ve defined two rules: addSubExpr and multDivExpr. multDivExpr is defined as an INT followed by zero or more expressions of the form: * or / followed by another INT. (The * at the end is again regular expression syntax for ‘zero or more occurrences’.) Thus a valid multDivExpr would be a single integer like 42, or expressions like 42*12 or 42*12/6. Notice that since we’ve told the lexer to ignore whitespace, we can insert blanks between the tokens and it won’t affect the result.

Finally, look at the addSubExpr rule. It consists of a single multDivExpr, or a multDivExpr followed by zero or more expressions of the form + or – followed by another multDivExpr. So a valid addSubExpr would be, again, a single integer, or a single multDivExpr like those in the previous paragraph, or an expression like 42*12+34, or 42*12+34/2 or 42*12+34-22 and so on.

If you haven’t made any errors typing this in, you should now be able to build the project in VS without errors. However, we haven’t actually written any C# code yet, so if you run the program nothing will happen.

Before we add some C#, though, it’s worth looking at the Class View in VS. You’ll see that there are two classes called CalculatorLexer and CalculatorParser and if you look at their list of methods and properties, you’ll see a whole pile of entries. This is the code that ANTLR generates from your grammar file. Feel free to have a look at the code, but don’t modify any of it since (a) it will probably break it and (b) any changes you make will get overwritten the next time you change the grammar file.

Finally, how do we incorporate the grammar into a C# program? Here’s what is needed to use the lexer and parser in C#:

using System;
using System.IO;
using Antlr.Runtime;
using Antlr.Runtime.Misc;

namespace AntlrTest01
{
  class Program
  {
    public static void Main(string[] args)
    {
      Stream inputStream = Console.OpenStandardInput();
      ANTLRInputStream input = new ANTLRInputStream(inputStream);
      CalculatorLexer lexer = new CalculatorLexer(input);
      CommonTokenStream tokens = new CommonTokenStream(lexer);
      CalculatorParser parser = new CalculatorParser(tokens);
      parser.addSubExpr();
    }
  }
}

Note the ‘using’ statements at the top that we need to get access to the ANTLR libraries.

First we open the standard input (via the console) as the input stream and use this to create an ANTLRInputStream. This stream is passed to the lexer, which produces a stream of tokens which is then passed to the parser. Finally we call addSubExpr(), which tries to apply the addSubExpr rule to the input.

We can run this program and then type some input into the console. When you’re done with the input, type Control-Z on a line by itself. Control-Z is the Windows character for ‘end of file’ and tells the parser the input is complete.

What you’ll find is that, no matter what you type in (even it’s incorrect), you’ll get no output. This is because we haven’t provided either code for handling correct input or error messages for incorrect input. We’ll leave the former for the next post, but we can provide some rudimentary error handling to finish off this post.

When a parser or lexer method detects invalid input, it calls the method ReportError(). This method is provided only as a virtual stub in the generated code, so if you want some output when an error is detected, you’ll need to override it. We can place the following method in Calculator.g3.lexer.cs:

using Antlr.Runtime;
using System;

namespace AntlrTest01
{
  partial class CalculatorLexer
  {
    public override void ReportError(RecognitionException e)
    {
      base.ReportError(e);
      Console.WriteLine("Error in lexer at line " + e.Line + ":" + e.CharPositionInLine);
    }
  }
}

And a similar method override can be placed in Calculator.g3.Parser.cs:

using Antlr.Runtime;
using System;

namespace AntlrTest01
{
  partial class CalculatorParser
  {
    public override void ReportError(RecognitionException e)
    {
      base.ReportError(e);
      Console.WriteLine("Error in parser at line " + e.Line + ":" + e.CharPositionInLine);
    }
  }
}

Now if you enter some incorrect input like 3+d you’ll get an error message:

3+d
^Z
Error in lexer at line 1:2
Error in parser at line 2:0

The lexer complains about character 2 (the location is zero-based so that’s the third character), since ‘d’ isn’t a valid token (remember we are accepting only integers, whitespace and the four operators). The parser complains about an error at the start of line 2, since the lexer has fed it only the two tokens 3 and +, which doesn’t match either of the parser’s rules.

Hopefully that’s enough to get started with ANTLR. Next time round we’ll look at adding some code to get the parser to actually do the calculations.

About these ads
Post a comment or leave a trackback: Trackback URL.

Comments

  • Selamet Hariadi  On April 29, 2013 at 2:46 PM

    how to show all error with input java.g ?

  • lenwhite.com  On August 7, 2013 at 9:17 PM

    Love the post.
    However, I don’t understand how we can call this:
    CalculatorLexer lexer = new CalculatorLexer(input);
    When CalculatorLexer has no constructor that takes an argument…?
    I get a compile error.

  • Anonymous  On November 16, 2013 at 2:26 PM

    If you are looking for the auto-generated Parser and Lexer files, you will find them in the Debug/obj/ directory. They do not show up in Visual Studio.
    Instead in VS you will find two classes Calculator.g3.lexer and Calculator.g3.parser which are PARTIAL CLASSES and are merged with the auto-generated ones mentioned above.

  • Anonymous  On December 3, 2013 at 8:12 PM

    Hi,
    this is a great post!
    Nevertheless, It always shows that error message. Also when I put a valid rule lie 3 + 4 or 3 *4+12 ore anything else. I use exactly the code you posted.
    Cheers,
    Magnus

Trackbacks

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 44 other followers

%d bloggers like this: