Antlr 4 with C# and Visual Studio 2012

It’s been more than a year since I posted anything to Programming Pages, so I figure I should rectify that (been busy mainly with my alter-ego, Physics Pages).

Recently, I had another look at the parsing package Antlr and discovered that a whole new version has come out (Antlr 4), along with a book The Definitive Antlr 4 Reference, written by Antlr’s author, Terence Parr. However, as with Antlr 3, the book is written exclusively in Java, so a fair bit of detective work is needed to discover how to use it with C# in Visual Studio. The code required both to specify the lexer and parser, and to write the supporting C#, has pretty well completely changed from Antlr 3, so a fresh tutorial is needed.

Installation in Visual Studio

Support for Antlr4 exists for Visual Studio 2012 (and presumably 2010, although I no longer have this installed so I can’t test it), but not, it seems, for Visual Studio 2013. To get the files you need, visit the Antlr download page and get the ‘ANTLR 4 C#.dll’ zip file. Next, click on the ‘C# releases’ link, and from the page that loads, scroll down to ‘Visual Studio 2010 Extensions’ and click on the ‘ANTLR Language Support’ link, then ‘Download’ the Visual Studio plugin from that page, then install it, making sure you’ve selected the correct version of Visual Studio in the dialog.

To incorporate Antlr 4 into a VS project, create a new project in VS. In the project directory, create a folder named Reference and within the Reference folder create another folder called Antlr4. Unzip the entire contents of the ‘ANTLR 4 C#.dll’ zip file into the Antlr4 folder. You’ll need to do this for each project you create.

Then, in Solution Explorer, right-click on the project and select ‘Unload project’. Then right click on the project name (it’ll say ‘unavailable’ after it, but that’s OK) and select ‘Edit <Your project name>.csproj. Scroll to near the end of the file where you should find the line:

  <Import Project="$(MSBuildToolsPath)\Microsoft.CSharp.targets" />

Insert the following lines directly after this line:

  <PropertyGroup>
    <!-- Folder containing Antlr4BuildTasks.dll -->
    <Antlr4BuildTaskPath>$(ProjectDir)..\Reference\Antlr4</Antlr4BuildTaskPath>
    <!-- Path to the ANTLR Tool itself. -->
    <Antlr4ToolPath>$(ProjectDir)..\Reference\Antlr4\antlr4-csharp-4.0.1-SNAPSHOT-complete.jar</Antlr4ToolPath>
  </PropertyGroup>
  <Import Project="$(ProjectDir)..\Reference\Antlr4\Antlr4.targets" />

Then right-click on the project name and select ‘Reload project’.

Finally, you’ll need to add a reference to the Antlr4 runtime in your project to provide access to the API. Right-click on References and select the correct version of the runtime dll from the Antlr4 folder you created above. The version should match the version of .NET that you’re using in your project, so if you’re using .NET 4.5, load the file ‘Antlr4.Runtime.v4.5.dll’. Now your project should be all set.

One final note: you’ll need Java to be installed in order to run Antlr4! This is true even if you’re coding in C#, since Antlr4 is written in Java and calls it to generate the code for your lexer and parser.

Writing an Antlr 4 grammar

You should now be in a position to start work on the actual code. If you installed the VS extension above, you can add an Antlr4 grammar to your project by right-clicking on the project name and selecting Add –> New Item. In the dialog you should see 7 Antlr items; 3 of these are for Antlr 4 files, with the other 4 being for Antlr 3. Since we’ll be working with both a lexer and a parser, select ‘ANTLR 4 Combined Grammar’, change the name of the file to whatever you like, and click ‘Add’.

As an illustration, we’ll generate a grammar for the good old four-function calculator, so we’ll call the project Calculator. In Solution Explorer you should find a file called Calculator.g4; this is where you write your parser and lexer rules.  The initial code in this file looks like this:

grammar Calculator;

@parser::members
{
	protected const int EOF = Eof;
}

@lexer::members
{
	protected const int EOF = Eof;
	protected const int HIDDEN = Hidden;
}

/*
 * Parser Rules
 */

compileUnit
	:	EOF
	;

/*
 * Lexer Rules
 */

WS
	:	' ' -> channel(HIDDEN)
	;

Don’t worry about the stuff up to line 12. What we’re interested in are the parser and lexer rules. The syntax for these has changed significantly from Antlr 3, to the extent that any grammar files you may have written for the earlier version very probably won’t work in Antlr 4.  However, the new syntax is much closer to the more standard ways of representing these rules in other systems like lex and yacc (if you’ve never heard of these, don’t worry; we won’t be using them).

Here’s the grammar file as modified for our calculator:

grammar Calculator;

@parser::members
{
	protected const int EOF = Eof;
}

@lexer::members
{
	protected const int EOF = Eof;
	protected const int HIDDEN = Hidden;
}

/*
 * Parser Rules
 */

prog: expr+ ;

expr : expr op=('*'|'/') expr	# MulDiv
	 | expr op=('+'|'-') expr	# AddSub
	 | INT					# int
	 | '(' expr ')'			# parens
	 ;

/*
 * Lexer Rules
 */
INT : [0-9]+;
MUL : '*';
DIV : '/';
ADD : '+';
SUB : '-';
WS
	:	(' ' | '\r' | '\n') -> channel(HIDDEN)
	;

First, look at the lexer rules. The INT token is defined using the usual regular expression for one or more digits. The four arithmetic operations are given labels that we’ll use later. Finally, we’ve modified the WS (whitespace) label so it includes blanks, returns and newlines. The ‘-> channel(HIDDEN)’ just tells the lexer to ignore whitespace.

Now look at the parser rules. The first rule is ‘prog’, which is defined as one or more ‘expr’s. The ‘expr’ is defined a single INT, or a single expr enclosed in parentheses, or two exprs separated by one of the four arithmetic operators.

Note that each line in the expr declaration has a label preceded by a # symbol at the end. These are not comments; rather they are tags that are used in writing the code that tells the parser what to do when each of these expressions is found.

This is where the biggest difference between Antlr 3 and Antlr 4 occurs: in Antlr 4, there is no code in the target language (C# here) written in the grammar file. All such code is moved elsewhere in the program. This makes the grammar file language-independent, so once you’ve written it you could drop it in to another project and use it support a Java application, or any other language for which Antlr 4 is defined.

Using the grammar in a C# program

So how exactly do you use the grammar to interpret an input string? Before we can use the lexer and parser, we need to write the code that should be run when each type of expression is parsed from the input. To do this, we need to write a C# class called a visitor. Antlr 4 has provided a base class for your visitor class, and it is named CalculatorBaseVisitor<Result>.  It is a generic class, and ‘Result’ is the data type that is returned by the various methods inside the class. Your job is to override some or all of the methods in the base class so that they run the code you want in response to each bit of parsed input.

In our case, we want the calculator to return an int for each expression it calculates, so create a new class called (say) CalculatorVisitor and make it inherit CalculatorBaseVisitor<int>. Your skeleton class looks like this:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;

namespace Calculator
{
  class CalculatorVisitor : CalculatorBaseVisitor<int>
  {
  }
}

To see what methods you need to override, it’s easiest to use VS’s Intellisense. Within the class type the keywords ‘public override’ after which Intellisense should pop up with a list of methods you can override. Look at the methods that start with ‘Visit’. Among these you should find a method for each label you assigned in the ‘expr’ definition in the grammar file above. Thus you should have VisitInt, VisitParens, VisitMulDiv and VisitAddSub. These are the methods you need to override. The other Visit methods you can ignore, as the versions provided in the base class will work fine.

Here’s the complete class. We’ll discuss the code in a minute:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;

namespace Calculator
{
  class CalculatorVisitor : CalculatorBaseVisitor<int>
  {
    public override int VisitInt(CalculatorParser.IntContext context)
    {
      return int.Parse(context.INT().GetText());
    }

    public override int VisitAddSub(CalculatorParser.AddSubContext context)
    {
      int left = Visit(context.expr(0));
      int right = Visit(context.expr(1));
      if (context.op.Type == CalculatorParser.ADD)
      {
        return left + right;
      }
      else
      {
        return left - right;
      }
    }

    public override int VisitMulDiv(CalculatorParser.MulDivContext context)
    {
      int left = Visit(context.expr(0));
      int right = Visit(context.expr(1));
      if (context.op.Type == CalculatorParser.MUL)
      {
        return left * right;
      }
      else
      {
        return left / right;
      }
    }

    public override int VisitParens(CalculatorParser.ParensContext context)
    {
      return Visit(context.expr());
    }
  }
}

First, notice that the argument of each method (called ‘context’) is different in each case, and corresponds to the tag used to define the method. Each ‘context’ object contains information required to evaluate it, and this content can be determined by how you defined the various expr lines in the grammar file above. (BTW, you may need to rebuild the project to get VS’s Intellisense to work here.)

Take the VisitInt() method first. This is called when the object being parsed is a single integer, so what you want to return is the value of this integer. We can get this by calling the INT() method of the context object, and then GetText() from that. This returns the integer as a string, so we need to use int.Parse to convert it to an int.

Now look at the VisitParens() method at the bottom. Here, the contents of the parentheses could be an expr of arbitrary complexity, but we want to return whatever that expr evaluates to as the result. This is what the inherited Visit() method does: it takes an expr as an argument and calls the correct method depending on the type of the expr. The expr() method being called from the ‘context’ returns this expr (which will have a tag attached to it to say whether it’s an Int, or an AddSub, or whatever) and Visit will call the correct method to evaluate this expr.

Finally, VisitAddSub and VisitMulDiv work pretty much the same way. Both of these represent binary operators, so there are two subsidiary exprs to evaluate before the operator is applied. In each case, we evaluate the left and right operands by calling Visit(context.expr(0)) and Visit(context.expr(1)) respectively. Then we check which operator is in the ‘context’. Note that in the grammar file above we defined a parameter called ‘op’ as the operator. Also note that the labels we gave to the four arithmetic operators in the lexer rules show up as fields within the Calculator Parser class, so we can compare the Type of ‘op’ with these labels to find out which operator we’re dealing with. Once we know that, we can return the correct calculation.

At long last, we’re ready to look at the code that uses all this stuff. Here’s the Main() function:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using System.IO;
using Antlr4.Runtime;
using Antlr4.Runtime.Misc;
using Antlr4.Runtime.Tree;

namespace Calculator
{
  class Program
  {
    static void Main(string[] args)
    {
      StreamReader inputStream = new StreamReader(Console.OpenStandardInput());
      AntlrInputStream input = new AntlrInputStream(inputStream.ReadToEnd());
      CalculatorLexer lexer = new CalculatorLexer(input);
      CommonTokenStream tokens = new CommonTokenStream(lexer);
      CalculatorParser parser = new CalculatorParser(tokens);
      IParseTree tree = parser.prog();
      Console.WriteLine(tree.ToStringTree(parser));
      CalculatorVisitor visitor = new CalculatorVisitor();
      Console.WriteLine(visitor.Visit(tree));
    }
  }
}

Note the ‘using’ statements at the top, required to access Antlr4.

We create a StreamReader from the console on line 17 so we can type in our expressions. Then on line 18 we pass the string read from the stream (via inputStream.ReadToEnd()) to an AntlrInputStream.

Important note: AntlrInputStream is supposed to accept a raw Stream object for its input, but I couldn’t get this to work. The program just hung up when attempting to read from the Stream directly. It seems this is a bug in Antlr  4 so may be fixed in a future release.

The lexer is then created with the AntlrInputStream, the tokens from the lexer are saved in a CommonTokenStream, which is then passed to the parser. The parser is then run by calling parser.prog(), which calls the ‘prog’ rule (defined in the grammar file above) that initializes the program. The output from the parser is saved in an IParseTree, which is then passed to a CalculatorVisitor to do the evaluations. Finally, the result of the parse + evaluation is printed.

To run the program, type one or more expressions in the console (you can separate them by newlines or blanks). When you’re done, type control-Z on a line by itself and the output from the program should then be displayed.

Advertisements
Post a comment or leave a trackback: Trackback URL.

Comments

  • Anonymous  On February 4, 2014 at 10:00 AM

    Great post.
    However compiling the solution I get a ‘Unknown build error: Object reference not set to an instance of an object’.
    What’s wrong?
    Thanks

  • Memin  On May 31, 2014 at 5:56 PM

    THank you, quite a comprehensive tutorial. It helped me a lot.

  • Tarek701  On July 28, 2014 at 7:29 PM

    Hey, growescience. Could you please explain me where to find that “BaseVisitor” etc. stuff? It didn’t generate for me. I just got Parser And Lexer.cs

    • growescience  On August 3, 2014 at 4:52 PM

      Sorry for delay – haven’t checked this blog for a while.
      It seems that the process I described works if you create an Empty C# Project, but not if you create a WPF project.
      In that case, the BaseVisitor file is in the ProjectDirectory\obj\Debug.

      • growescience  On August 3, 2014 at 5:01 PM

        Also works if you create a C# console application. I’m not sure how you integrate Antlr4 into WPF or Windows Forms.

  • Everything Android  On August 12, 2014 at 6:24 PM

    Your method of explaining everything in this pozt is really pleasant, ever one can effortlessly understand it, Thanks a lot.

  • Anonymous  On November 22, 2014 at 9:10 AM

    A well written post which gives you a good introduction into using ANTLR4 with C#.

  • Fevzi  On January 16, 2015 at 3:25 PM

    use inputStream.ReadLine() not inputStream.ReadToEnd() .

  • raptorvsrex  On October 12, 2015 at 10:48 PM

    Thanks for this, it’s very helpful!

    I used your example with mono on linux and antlr-4.5.1, all from the command line. It worked with some minor modifications. There seems to be a bug in ANTLR that causes ICalculatorVisitor.cs to instead be named CalculatorVisitor.cs. Here’s what I had to do to work around that, starting with your files in place (CalculatorVisitor.cs, Program.cs, Caclulator.g4):

    mv CalculatorVisitor.cs CalculatorVisitor.cs.save
    java -jar antlr-4.5.1-complete.jar -visitor -Dlanguage=CSharp Calculator.g4
    mv CalculatorVisitor.cs ICalculatorVisitor.cs
    mv CalculatorVisitor.cs.save CalculatorVisitor.cs
    mcs -out:calculator.exe -r:Antlr4.Runtime.dll Calc*.cs Program.cs ICalculatorVisitor.cs

    Presumably at some point the ANTLR folks will fix this, and then this should work as written but without the three mv commands. This should also work from the Windows command line (if you have mono installed), but you’ll need to use rename instead of mv.

Trackbacks

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: