Feeds

JavaCC: Don't talk back

Implementing a parser-analyser

Choosing a cloud hosting partner with confidence

In many cases, with the advent of XML, if data must be exchanged, or information read, a simple solution is to mark that document up using XML and then parse it using an XML parser.

However, in some situations the documents to be processed may not be in XML format. This could be because of legacy systems, external constraints or because they adhere to other standards. In such situations you still need to parse the documents to ensure that they conform to whatever standard is expected, to analyse the contents of such documents to determine the information they provide and to take some action based on that information. To do this you may implement your own parser-analyser or you could choose to look at an existing parser generator tool. Probably the best stabled parser generator tool is JavaCC (Java Compiler Compiler).

Java CC is a parser generator in the best tradition of tools such as YACC (Yet Another Compiler Compiler). JavaCC started out life as Jack and was developed by Sun Microsystems. It was then passed to a company called Metamata and renamed JavaCC. It is now an open source project and is very widely used.

JavaCC offers the Java developer a tool that processes a grammar specification and produces a set of Java classes that can read and analyse input that matches that grammar. JavaCC also provides additional tools that can be used with the main JavaCC tool such as the JJTree tree building tool for displaying grammars. The grammar to be analysed is defined in a BNF-like notation that is quick to learn and allows Java to be embedded within it (to allow callbacks to your own Java code).

Obtaining JavaCC

You can download JavaCC from the official JavaCC website. From here you can download the latest version of the JavaCC classes (or indeed the source if you so wish). What you get when you download JavaCC Binary distribution is:

  • The javacc.jar
  • Some command line tools (such as JJTree)
  • Javadoc Documentation
  • And, a number of examples

You can also find sample JavaCC Grammars here.

A useful textbook is Aho, R. Sethi, and J.D. Ullman, Compilers: Principles, Techniques and Tools, Prentice Hall, 1986. You can buy it at Cash 'n' Carrion.

A Sample Grammar

A JavaCC grammar file can be divided into a number of parts. These parts are:

  • An optional list of options
  • Java compilation unit
  • The lexical specification
  • A list of grammar productions

Each of these is described in a little more detail below:

A List of Options

The developer can influence the way in which the grammar is processed using these options (which may also be specified from the command line if the JavaCC command is being used). In the following example, we declare the LOOKAHEAD option that specifies number of tokens to look ahead before making a decision at a choice point during parsing:

options {
   LOOKAHEAD=2;
}

Other options include STATIC (which indicates whether a static only set of methods should be produced), DEBUG_PARSER (which can be used to include debugging information in the generated parser and IGNORE_CASE (which indicates whether the grammar should be case sensitive or not).

Java Compilation Unit

The Java compilation unit is enclosed between PARSER_BEGIN(name) and PARSER_END(name) . The name that follows PARSER_BEGIN and PARSER_END must be the same and this identifies the name of the generated parser. This allows you to provide definitions to be used with the generated Java parser class. For example, in the following code snippet we import the java.io package and define a main method. This allows the resulting SimpleCalculator parser class to be used stand alone. The main method reads a data file called “test.dat” which is then parsed by the generated SimpleCalculator parser class.

PARSER_BEGIN(SimpleCalculator)
import java.io.*;

public class SimpleCalculator {
   public static void main(String [] args) throws Exception {
      File file = new File("test.dat");
      System.out.println("Reading: " + file.getAbsolutePath());
      FileReader reader = new FileReader("test.dat") ;
      SimpleCalculator sc = new SimpleCalculator(reader);
      while (true) {
         sc.calc();
      }
   }
}

PARSER_END(SimpleCalculator)

The Lexical Specification

Next we have our lexical specification. This defines the tokens to be recognised when parsing data input. For example:

SKIP:
{
   " "
|   "\r"
}

TOKEN:
{
   <NUMBER:(<DIGIT>)>
|  <DIGIT:["0"-"9"]>
|  <EOL: "\n" >

}

This specifies that white space and carriage returns should be skipped. It then defines the set of tokens to be understood by the parser. In this case it means that NUMBERs are defined as being digits in the range 0 to 9. Finally the EOL (or End Of Line) is defined to be a Newline (“/n”).

Intelligent flash storage arrays

More from The Register

next story
UNIX greybeards threaten Debian fork over systemd plan
'Veteran Unix Admins' fear desktop emphasis is betraying open source
Netscape Navigator - the browser that started it all - turns 20
It was 20 years ago today, Marc Andreeesen taught the band to play
Sign off my IT project or I’ll PHONE your MUM
Honestly, it’s a piece of piss
Return of the Jedi – Apache reclaims web server crown
.london, .hamburg and .公司 - that's .com in Chinese - storm the web server charts
Chrome 38's new HTML tag support makes fatties FIT and SKINNIER
First browser to protect networks' bandwith using official spec
Admins! Never mind POODLE, there're NEW OpenSSL bugs to splat
Four new patches for open-source crypto libraries
Torvalds CONFESSES: 'I'm pretty good at alienating devs'
Admits to 'a metric ****load' of mistakes during work with Linux collaborators
prev story

Whitepapers

Forging a new future with identity relationship management
Learn about ForgeRock's next generation IRM platform and how it is designed to empower CEOS's and enterprises to engage with consumers.
Cloud and hybrid-cloud data protection for VMware
Learn how quick and easy it is to configure backups and perform restores for VMware environments.
Three 1TB solid state scorchers up for grabs
Big SSDs can be expensive but think big and think free because you could be the lucky winner of one of three 1TB Samsung SSD 840 EVO drives that we’re giving away worth over £300 apiece.
Reg Reader Research: SaaS based Email and Office Productivity Tools
Read this Reg reader report which provides advice and guidance for SMBs towards the use of SaaS based email and Office productivity tools.
Security for virtualized datacentres
Legacy security solutions are inefficient due to the architectural differences between physical and virtual environments.