Feeds

Expressing yourself in Java: Regular Expressions

They can be rather opaque, but also extremely powerful

Boost IT visibility and business value

Regular expressions are an area of computing that most of us know a little about, have a rough understanding of, but have often avoided using except when absolutely necessary. In the past, they often required the use of tools such as Perl, rather than languages such as Java.

However, for some time now, in fact since Java version 1.4, there has been a regular expression package within the stand Java libraries. That is, the Java package, java.util.regex. This package contains two classes, which together provide much of the pattern matching power of tools such as Perl.

In this column we will look at what a regular expression is, why it is useful and how the regular expression package can be used from within Java. As a concrete example, we will look at a simpler Java class that can be used to verify the format of VAT numbers form a number of European countries.

Regular expressions

A regular expression (often referred to as a regex) is a pattern describing a certain amount of text. This pattern describes the structure of a string of text. It can be used to determine if an arbitrary string matches the defined pattern. For example, let's assume (it's an oversimplification) that a UK postcode follows a pattern that can be described as:

  • Two letters
  • Followed by 1 or 2 numbers
  • Followed by a space
  • Followed by 1 or 2 numbers
  • And finally by two letters

Within our programs, a regular expression is exactly like this, except that it is presented in a rather more concise (and more powerful) form.

Literal patterns

A very simple regular expression might be the pattern 'John'. This regular expression could be used to determine whether other strings contained the pattern 'John'. In Java, creating a Pattern object generated from the regular expression to be searched for would do this. This Pattern object is then used to create a Matcher object that is derived form the Pattern object and the string to be processed.

For example, the following program allows any string to be searched for the pattern "John". Line 12 defines the pattern, line 13 creates the Pattern object and line 14 creates the Matcher object. As can be seen from the imports the Pattern and Matcher classes are defined in the java.util.regex package. The Pattern class acts as a factory for creating pattern objects. In turn the Pattern object acts as a factory for creating Matcher objects. The resulting Matcher object is then used in line 19 to determine whether the string in args[0] contains the pattern 'John' or not.

package uk.co.regdeveloper;

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class SimpleVerifier {

    /**
     * @param args
     */
    public static void main(String[] args) {
        String format = "John";                      // line 12
        Pattern pattern = Pattern.compile(format);   // line 13
        Matcher matcher = pattern.matcher(args[0]);  // line 14
        System.out.println(format +
                           " is contained in the String >" +
                           args[0] +
                           "< -> " +
                           matcher.find());          // line 19
    }
}

Regular expression patterns

The previous example, while functionally complete, is not the most exciting regular expression that can be imagined. Let us consider instead a VAT number. Let us start with the structure of a VAT number in France. It is:

1 block of two characters, a space and then 1 block of 9 digits

We could, of course, write a program that could check each character in a String to determine whether it matched this rule. However, with regular expressions we can do this in a far more elegant way. We can define a regular expression pattern that contains Character sets, repetition and Short hand character classes. For example, we want to say that the VAT number starts with a block of two characters. This can be done using the pattern:

[A-Za-z]{2}

This states that any character A to Z (upper case or lower case) can appear twice in this pattern. If this pattern were used to create a matcher object, then the resulting matcher would return the following results:

AB -> true
A -> false

Such patterns are so common that there is a short hand form for the letters A-Za-z which is "\w". This short hand form matches a "word character" (alphanumeric characters plus underscore). There are also equivalents for digits ("\d") and for spaces ("\s"). Thus if we wish to write a pattern for two characters, a space and two numbers we could create the pattern:

\w{2}\s\d{2}

This is exactly what has been done in the following FormatTester program. Note that because "\" has a special meaning within Java strings, it must be escaped by another "\". This does make the pattern less readable, but maintains compatibility with existing regular expression formats.

package uk.co.regdeveloper;

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class FormatTester {

    public static void main(String[] args) {
        String format = "\\w{2}\\s\\d{2}";
        Pattern pattern = Pattern.compile(format);
        Matcher matcher = pattern.matcher("AB 12");
        System.out.println(matcher.find());   // line 12
        matcher = pattern.matcher("ABC1A1");
        System.out.println(matcher.find());   // line 14
    }

}

The output at line 12 returns true, while the output at line 14 returns false.

Going further with Regular Expression Patterns

There is a great deal that you can do with regular expressions. For example, you can indicate that a character is optional (using a ?). You can anchor parts of a pattern to a certain position. You can even take into account what is around a particular character. For example, q(?=u) matches the q in question, but not in qestion (note the missing u). You can also use alternation. Alternation is the regular expression equivalent of "or". For example, you can write "Jack|Jill" for Jack or Jill.

To really exploit the power of regular expressions you should start by looking at some of the regular expressions material listed at the end of this column.

A practical example

To illustrate how regular expressions might be used in a larger program we will look at a simple European VAT number verifier. The aim of this program is to verify if a given string is a valid European VAT number given a pattern for a particular countries VAT code. The Program is presented below.

The first part of the program defines the patterns that will be used to determine whether a particular VAT number confirms to a particular countries VAT number format. Note that the formats presented are illustrative and intended to show various aspects of regular expressions within Java and should not be treated as a full or complete definition of a set of regular expressions for VAT number processing.

The patterns defined display various aspects of the pattern language we have looked at above for regular expressions. For example, the COUNTRY_CODE_EXP defines any two characters, while the GERMANY VAT number pattern defines a sequence of nine digits. In turn the FRANCE pattern is a combination of two characters, a space and nine digits. The UK pattern incorporates these features and includes the alternates notation ("or") using the "|" to indicate that the pattern can be either a sequence of digits and spaces, or a pattern of two letters and three numbers where the two letters must either be GD or HA.

package uk.co.regdeveloper;

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class VatNumberVerifier {
    /** If a VAT number starts with a country code we will
     * remove it */
    private static final String COUNTRY_CODE_EXP = "\\w{2}";  
    /** 1 block of 9 digits */
    private static final String GERMANY = "\\d{9}";
    /**
     * 1 block of two characters, a space and then
     * 1 block of 9 digits
     */
    private static final String FRANCE = "\\w{2}\\s\\d{9}";
    /**
     * 999 9999 99 or
     * GD999 or
     * HA999.
     * 
     * GD Identifies Government Departments.
     * HA Identifies Health Authorities.
     */
    private static final String UK = 
       "\\d{3}\\s\\d{4}\\s\\d{2}|(GD|HA)\\d{3}";

Following on from this we define a single instance variable to hold the country pattern and initialise it within the constructor:

    private Pattern countryCodePattern = null;
    
    VatNumberVerifier() {
        countryCodePattern = Pattern.compile(COUNTRY_CODE_EXP);
    }

Next we define the method to strip off the country codes (if present). This method uses the pre-configured countryCodePattern Pattern object to determine if the number presented to it starts with a valid county code. If it does, then the first two characters of the VAT number are removed. The resulting string is then returned. If the vat number does not contain a country code then the string is returned as is.

     private String stripCountryCodes(String vatNumber) {
        String result = vatNumber;
        Matcher matcher = countryCodePattern.matcher(vatNumber);
        if (matcher.find()) {
            result = result.substring(2);
        }
        return result;
    }

Finally, we can define the method that will actually perform the verification of the VAT numbers. This method returns true or false depending upon whether the VAT number matches the pattern passed into it.

     public boolean verify(String format, String vatNumber) {
        vatNumber = stripCountryCodes(vatNumber);
        Pattern pattern = Pattern.compile(format);
        Matcher matcher = pattern.matcher(vatNumber);
        return matcher.find();
    }

This method can now be used by a test harness to verify some VAT numbers. For example:

public static void main(String [] args) {
    VatNumberVerifier verifier = new VatNumberVerifier();
    System.out.println("UK Verify: " +
                       verifier.verify(UK, "UK 728 7030 32"));
    System.out.println("UK Verify: " +
                       verifier.verify(UK, "UK HA787"));
    System.out.println("UK Verify: " +
                       verifier.verify(UK, "UK AB56GGF"));
    System.out.println("FRANCE Verify: " +
                       verifier.verify(FRANCE, "FR AA 363478400"));
    System.out.println("GERMANY Verify: " +
                       verifier.verify(GERMANY, "DE 213709651"));
    System.out.println("GERMANY Verify: " +
                       verifier.verify(GERMANY, "DE 136665975"));
}

The result of running this test harness is:

UK Verify: true
UK Verify: true
UK Verify: false
FRANCE Verify: true
GERMANY Verify: true
GERMANY Verify: true

Conclusions

Regular expressions are extremely powerful and very useful and available directly from within Java. However, they need to be used with care. This is both because of the potential complexity of the patterns that can be defined and the ease with an error can be introduced. For example, the pattern \\d{2}\\s{d2} will correctly match the string "12 12" but also match the string "12 123" unless it is delimited to strictly match only the former. In addition, due to the need to include additional "\" characters, the regular expressions can become almost unreadable. However, used wisely they offer huge benefits for processing text for the extraction of data, information, knowledge and analysis.

Further Reading

Mastering Regular Expressions, Friedl, Jeffrey E. F., O'Reilly, Cambridge, MA, ISBN 0596002890, 2002.

Programming Perl, Larry Wall, Tom Christiansen & Jon Orwant, O'Reilly, Cambridge, MA, ISBN 0596000278, 2000.

An Online Regular Expression Tutorial can be found here.

The Java Tutorial: Regular Expressions can be found here.

The essential guide to IT transformation

More from The Register

next story
Munich considers dumping Linux for ... GULP ... Windows!
Give a penguinista a hug, the Outlook's not good for open source's poster child
The Return of BSOD: Does ANYONE trust Microsoft patches?
Sysadmins, you're either fighting fires or seen as incompetents now
Intel's Raspberry Pi rival Galileo can now run Windows
Behold the Internet of Things. Wintel Things
Microsoft cries UNINSTALL in the wake of Blue Screens of Death™
Cache crash causes contained choloric calamity
Eat up Martha! Microsoft slings handwriting recog into OneNote on Android
Freehand input on non-Windows kit for the first time
Time to move away from Windows 7 ... whoa, whoa, who said anything about Windows 8?
Start migrating now to avoid another XPocalypse – Gartner
You'll find Yoda at the back of every IT conference
The piss always taking is he. Bastard the.
prev story

Whitepapers

5 things you didn’t know about cloud backup
IT departments are embracing cloud backup, but there’s a lot you need to know before choosing a service provider. Learn all the critical things you need to know.
Implementing global e-invoicing with guaranteed legal certainty
Explaining the role local tax compliance plays in successful supply chain management and e-business and how leading global brands are addressing this.
Build a business case: developing custom apps
Learn how to maximize the value of custom applications by accelerating and simplifying their development.
Rethinking backup and recovery in the modern data center
Combining intelligence, operational analytics, and automation to enable efficient, data-driven IT organizations using the HP ABR approach.
Next gen security for virtualised datacentres
Legacy security solutions are inefficient due to the architectural differences between physical and virtual environments.