Feeds

Expressing yourself in Java: Regular Expressions

They can be rather opaque, but also extremely powerful

Internet Security Threat Report 2014

Regular expressions are an area of computing that most of us know a little about, have a rough understanding of, but have often avoided using except when absolutely necessary. In the past, they often required the use of tools such as Perl, rather than languages such as Java.

However, for some time now, in fact since Java version 1.4, there has been a regular expression package within the stand Java libraries. That is, the Java package, java.util.regex. This package contains two classes, which together provide much of the pattern matching power of tools such as Perl.

In this column we will look at what a regular expression is, why it is useful and how the regular expression package can be used from within Java. As a concrete example, we will look at a simpler Java class that can be used to verify the format of VAT numbers form a number of European countries.

Regular expressions

A regular expression (often referred to as a regex) is a pattern describing a certain amount of text. This pattern describes the structure of a string of text. It can be used to determine if an arbitrary string matches the defined pattern. For example, let's assume (it's an oversimplification) that a UK postcode follows a pattern that can be described as:

  • Two letters
  • Followed by 1 or 2 numbers
  • Followed by a space
  • Followed by 1 or 2 numbers
  • And finally by two letters

Within our programs, a regular expression is exactly like this, except that it is presented in a rather more concise (and more powerful) form.

Literal patterns

A very simple regular expression might be the pattern 'John'. This regular expression could be used to determine whether other strings contained the pattern 'John'. In Java, creating a Pattern object generated from the regular expression to be searched for would do this. This Pattern object is then used to create a Matcher object that is derived form the Pattern object and the string to be processed.

For example, the following program allows any string to be searched for the pattern "John". Line 12 defines the pattern, line 13 creates the Pattern object and line 14 creates the Matcher object. As can be seen from the imports the Pattern and Matcher classes are defined in the java.util.regex package. The Pattern class acts as a factory for creating pattern objects. In turn the Pattern object acts as a factory for creating Matcher objects. The resulting Matcher object is then used in line 19 to determine whether the string in args[0] contains the pattern 'John' or not.

package uk.co.regdeveloper;

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class SimpleVerifier {

    /**
     * @param args
     */
    public static void main(String[] args) {
        String format = "John";                      // line 12
        Pattern pattern = Pattern.compile(format);   // line 13
        Matcher matcher = pattern.matcher(args[0]);  // line 14
        System.out.println(format +
                           " is contained in the String >" +
                           args[0] +
                           "< -> " +
                           matcher.find());          // line 19
    }
}

Regular expression patterns

The previous example, while functionally complete, is not the most exciting regular expression that can be imagined. Let us consider instead a VAT number. Let us start with the structure of a VAT number in France. It is:

1 block of two characters, a space and then 1 block of 9 digits

We could, of course, write a program that could check each character in a String to determine whether it matched this rule. However, with regular expressions we can do this in a far more elegant way. We can define a regular expression pattern that contains Character sets, repetition and Short hand character classes. For example, we want to say that the VAT number starts with a block of two characters. This can be done using the pattern:

[A-Za-z]{2}

This states that any character A to Z (upper case or lower case) can appear twice in this pattern. If this pattern were used to create a matcher object, then the resulting matcher would return the following results:

AB -> true
A -> false

Such patterns are so common that there is a short hand form for the letters A-Za-z which is "\w". This short hand form matches a "word character" (alphanumeric characters plus underscore). There are also equivalents for digits ("\d") and for spaces ("\s"). Thus if we wish to write a pattern for two characters, a space and two numbers we could create the pattern:

\w{2}\s\d{2}

This is exactly what has been done in the following FormatTester program. Note that because "\" has a special meaning within Java strings, it must be escaped by another "\". This does make the pattern less readable, but maintains compatibility with existing regular expression formats.

package uk.co.regdeveloper;

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class FormatTester {

    public static void main(String[] args) {
        String format = "\\w{2}\\s\\d{2}";
        Pattern pattern = Pattern.compile(format);
        Matcher matcher = pattern.matcher("AB 12");
        System.out.println(matcher.find());   // line 12
        matcher = pattern.matcher("ABC1A1");
        System.out.println(matcher.find());   // line 14
    }

}

The output at line 12 returns true, while the output at line 14 returns false.

Going further with Regular Expression Patterns

There is a great deal that you can do with regular expressions. For example, you can indicate that a character is optional (using a ?). You can anchor parts of a pattern to a certain position. You can even take into account what is around a particular character. For example, q(?=u) matches the q in question, but not in qestion (note the missing u). You can also use alternation. Alternation is the regular expression equivalent of "or". For example, you can write "Jack|Jill" for Jack or Jill.

To really exploit the power of regular expressions you should start by looking at some of the regular expressions material listed at the end of this column.

A practical example

To illustrate how regular expressions might be used in a larger program we will look at a simple European VAT number verifier. The aim of this program is to verify if a given string is a valid European VAT number given a pattern for a particular countries VAT code. The Program is presented below.

The first part of the program defines the patterns that will be used to determine whether a particular VAT number confirms to a particular countries VAT number format. Note that the formats presented are illustrative and intended to show various aspects of regular expressions within Java and should not be treated as a full or complete definition of a set of regular expressions for VAT number processing.

The patterns defined display various aspects of the pattern language we have looked at above for regular expressions. For example, the COUNTRY_CODE_EXP defines any two characters, while the GERMANY VAT number pattern defines a sequence of nine digits. In turn the FRANCE pattern is a combination of two characters, a space and nine digits. The UK pattern incorporates these features and includes the alternates notation ("or") using the "|" to indicate that the pattern can be either a sequence of digits and spaces, or a pattern of two letters and three numbers where the two letters must either be GD or HA.

package uk.co.regdeveloper;

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class VatNumberVerifier {
    /** If a VAT number starts with a country code we will
     * remove it */
    private static final String COUNTRY_CODE_EXP = "\\w{2}";  
    /** 1 block of 9 digits */
    private static final String GERMANY = "\\d{9}";
    /**
     * 1 block of two characters, a space and then
     * 1 block of 9 digits
     */
    private static final String FRANCE = "\\w{2}\\s\\d{9}";
    /**
     * 999 9999 99 or
     * GD999 or
     * HA999.
     * 
     * GD Identifies Government Departments.
     * HA Identifies Health Authorities.
     */
    private static final String UK = 
       "\\d{3}\\s\\d{4}\\s\\d{2}|(GD|HA)\\d{3}";

Following on from this we define a single instance variable to hold the country pattern and initialise it within the constructor:

    private Pattern countryCodePattern = null;
    
    VatNumberVerifier() {
        countryCodePattern = Pattern.compile(COUNTRY_CODE_EXP);
    }

Next we define the method to strip off the country codes (if present). This method uses the pre-configured countryCodePattern Pattern object to determine if the number presented to it starts with a valid county code. If it does, then the first two characters of the VAT number are removed. The resulting string is then returned. If the vat number does not contain a country code then the string is returned as is.

     private String stripCountryCodes(String vatNumber) {
        String result = vatNumber;
        Matcher matcher = countryCodePattern.matcher(vatNumber);
        if (matcher.find()) {
            result = result.substring(2);
        }
        return result;
    }

Finally, we can define the method that will actually perform the verification of the VAT numbers. This method returns true or false depending upon whether the VAT number matches the pattern passed into it.

     public boolean verify(String format, String vatNumber) {
        vatNumber = stripCountryCodes(vatNumber);
        Pattern pattern = Pattern.compile(format);
        Matcher matcher = pattern.matcher(vatNumber);
        return matcher.find();
    }

This method can now be used by a test harness to verify some VAT numbers. For example:

public static void main(String [] args) {
    VatNumberVerifier verifier = new VatNumberVerifier();
    System.out.println("UK Verify: " +
                       verifier.verify(UK, "UK 728 7030 32"));
    System.out.println("UK Verify: " +
                       verifier.verify(UK, "UK HA787"));
    System.out.println("UK Verify: " +
                       verifier.verify(UK, "UK AB56GGF"));
    System.out.println("FRANCE Verify: " +
                       verifier.verify(FRANCE, "FR AA 363478400"));
    System.out.println("GERMANY Verify: " +
                       verifier.verify(GERMANY, "DE 213709651"));
    System.out.println("GERMANY Verify: " +
                       verifier.verify(GERMANY, "DE 136665975"));
}

The result of running this test harness is:

UK Verify: true
UK Verify: true
UK Verify: false
FRANCE Verify: true
GERMANY Verify: true
GERMANY Verify: true

Conclusions

Regular expressions are extremely powerful and very useful and available directly from within Java. However, they need to be used with care. This is both because of the potential complexity of the patterns that can be defined and the ease with an error can be introduced. For example, the pattern \\d{2}\\s{d2} will correctly match the string "12 12" but also match the string "12 123" unless it is delimited to strictly match only the former. In addition, due to the need to include additional "\" characters, the regular expressions can become almost unreadable. However, used wisely they offer huge benefits for processing text for the extraction of data, information, knowledge and analysis.

Further Reading

Mastering Regular Expressions, Friedl, Jeffrey E. F., O'Reilly, Cambridge, MA, ISBN 0596002890, 2002.

Programming Perl, Larry Wall, Tom Christiansen & Jon Orwant, O'Reilly, Cambridge, MA, ISBN 0596000278, 2000.

An Online Regular Expression Tutorial can be found here.

The Java Tutorial: Regular Expressions can be found here.

Choosing a cloud hosting partner with confidence

More from The Register

next story
Netscape Navigator - the browser that started it all - turns 20
It was 20 years ago today, Marc Andreeesen taught the band to play
Sway: Microsoft's new Office app doesn't have an Undo function
Content aggregation, meet the workplace ... oh
Sign off my IT project or I’ll PHONE your MUM
Honestly, it’s a piece of piss
Return of the Jedi – Apache reclaims web server crown
.london, .hamburg and .公司 - that's .com in Chinese - storm the web server charts
NetWare sales revive in China thanks to that man Snowden
If it ain't Microsoft, it's in fashion behind the Great Firewall
Chrome 38's new HTML tag support makes fatties FIT and SKINNIER
First browser to protect networks' bandwith using official spec
Admins! Never mind POODLE, there're NEW OpenSSL bugs to splat
Four new patches for open-source crypto libraries
prev story

Whitepapers

Forging a new future with identity relationship management
Learn about ForgeRock's next generation IRM platform and how it is designed to empower CEOS's and enterprises to engage with consumers.
Why cloud backup?
Combining the latest advancements in disk-based backup with secure, integrated, cloud technologies offer organizations fast and assured recovery of their critical enterprise data.
Win a year’s supply of chocolate
There is no techie angle to this competition so we're not going to pretend there is, but everyone loves chocolate so who cares.
High Performance for All
While HPC is not new, it has traditionally been seen as a specialist area – is it now geared up to meet more mainstream requirements?
Intelligent flash storage arrays
Tegile Intelligent Storage Arrays with IntelliFlash helps IT boost storage utilization and effciency while delivering unmatched storage savings and performance.