Feeds

Expressing yourself in Java: Regular Expressions

They can be rather opaque, but also extremely powerful

Intelligent flash storage arrays

Regular expressions are an area of computing that most of us know a little about, have a rough understanding of, but have often avoided using except when absolutely necessary. In the past, they often required the use of tools such as Perl, rather than languages such as Java.

However, for some time now, in fact since Java version 1.4, there has been a regular expression package within the stand Java libraries. That is, the Java package, java.util.regex. This package contains two classes, which together provide much of the pattern matching power of tools such as Perl.

In this column we will look at what a regular expression is, why it is useful and how the regular expression package can be used from within Java. As a concrete example, we will look at a simpler Java class that can be used to verify the format of VAT numbers form a number of European countries.

Regular expressions

A regular expression (often referred to as a regex) is a pattern describing a certain amount of text. This pattern describes the structure of a string of text. It can be used to determine if an arbitrary string matches the defined pattern. For example, let's assume (it's an oversimplification) that a UK postcode follows a pattern that can be described as:

  • Two letters
  • Followed by 1 or 2 numbers
  • Followed by a space
  • Followed by 1 or 2 numbers
  • And finally by two letters

Within our programs, a regular expression is exactly like this, except that it is presented in a rather more concise (and more powerful) form.

Literal patterns

A very simple regular expression might be the pattern 'John'. This regular expression could be used to determine whether other strings contained the pattern 'John'. In Java, creating a Pattern object generated from the regular expression to be searched for would do this. This Pattern object is then used to create a Matcher object that is derived form the Pattern object and the string to be processed.

For example, the following program allows any string to be searched for the pattern "John". Line 12 defines the pattern, line 13 creates the Pattern object and line 14 creates the Matcher object. As can be seen from the imports the Pattern and Matcher classes are defined in the java.util.regex package. The Pattern class acts as a factory for creating pattern objects. In turn the Pattern object acts as a factory for creating Matcher objects. The resulting Matcher object is then used in line 19 to determine whether the string in args[0] contains the pattern 'John' or not.

package uk.co.regdeveloper;

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class SimpleVerifier {

    /**
     * @param args
     */
    public static void main(String[] args) {
        String format = "John";                      // line 12
        Pattern pattern = Pattern.compile(format);   // line 13
        Matcher matcher = pattern.matcher(args[0]);  // line 14
        System.out.println(format +
                           " is contained in the String >" +
                           args[0] +
                           "< -> " +
                           matcher.find());          // line 19
    }
}

Regular expression patterns

The previous example, while functionally complete, is not the most exciting regular expression that can be imagined. Let us consider instead a VAT number. Let us start with the structure of a VAT number in France. It is:

1 block of two characters, a space and then 1 block of 9 digits

We could, of course, write a program that could check each character in a String to determine whether it matched this rule. However, with regular expressions we can do this in a far more elegant way. We can define a regular expression pattern that contains Character sets, repetition and Short hand character classes. For example, we want to say that the VAT number starts with a block of two characters. This can be done using the pattern:

[A-Za-z]{2}

This states that any character A to Z (upper case or lower case) can appear twice in this pattern. If this pattern were used to create a matcher object, then the resulting matcher would return the following results:

AB -> true
A -> false

Such patterns are so common that there is a short hand form for the letters A-Za-z which is "\w". This short hand form matches a "word character" (alphanumeric characters plus underscore). There are also equivalents for digits ("\d") and for spaces ("\s"). Thus if we wish to write a pattern for two characters, a space and two numbers we could create the pattern:

\w{2}\s\d{2}

This is exactly what has been done in the following FormatTester program. Note that because "\" has a special meaning within Java strings, it must be escaped by another "\". This does make the pattern less readable, but maintains compatibility with existing regular expression formats.

package uk.co.regdeveloper;

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class FormatTester {

    public static void main(String[] args) {
        String format = "\\w{2}\\s\\d{2}";
        Pattern pattern = Pattern.compile(format);
        Matcher matcher = pattern.matcher("AB 12");
        System.out.println(matcher.find());   // line 12
        matcher = pattern.matcher("ABC1A1");
        System.out.println(matcher.find());   // line 14
    }

}

The output at line 12 returns true, while the output at line 14 returns false.

Going further with Regular Expression Patterns

There is a great deal that you can do with regular expressions. For example, you can indicate that a character is optional (using a ?). You can anchor parts of a pattern to a certain position. You can even take into account what is around a particular character. For example, q(?=u) matches the q in question, but not in qestion (note the missing u). You can also use alternation. Alternation is the regular expression equivalent of "or". For example, you can write "Jack|Jill" for Jack or Jill.

To really exploit the power of regular expressions you should start by looking at some of the regular expressions material listed at the end of this column.

A practical example

To illustrate how regular expressions might be used in a larger program we will look at a simple European VAT number verifier. The aim of this program is to verify if a given string is a valid European VAT number given a pattern for a particular countries VAT code. The Program is presented below.

The first part of the program defines the patterns that will be used to determine whether a particular VAT number confirms to a particular countries VAT number format. Note that the formats presented are illustrative and intended to show various aspects of regular expressions within Java and should not be treated as a full or complete definition of a set of regular expressions for VAT number processing.

The patterns defined display various aspects of the pattern language we have looked at above for regular expressions. For example, the COUNTRY_CODE_EXP defines any two characters, while the GERMANY VAT number pattern defines a sequence of nine digits. In turn the FRANCE pattern is a combination of two characters, a space and nine digits. The UK pattern incorporates these features and includes the alternates notation ("or") using the "|" to indicate that the pattern can be either a sequence of digits and spaces, or a pattern of two letters and three numbers where the two letters must either be GD or HA.

package uk.co.regdeveloper;

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class VatNumberVerifier {
    /** If a VAT number starts with a country code we will
     * remove it */
    private static final String COUNTRY_CODE_EXP = "\\w{2}";  
    /** 1 block of 9 digits */
    private static final String GERMANY = "\\d{9}";
    /**
     * 1 block of two characters, a space and then
     * 1 block of 9 digits
     */
    private static final String FRANCE = "\\w{2}\\s\\d{9}";
    /**
     * 999 9999 99 or
     * GD999 or
     * HA999.
     * 
     * GD Identifies Government Departments.
     * HA Identifies Health Authorities.
     */
    private static final String UK = 
       "\\d{3}\\s\\d{4}\\s\\d{2}|(GD|HA)\\d{3}";

Following on from this we define a single instance variable to hold the country pattern and initialise it within the constructor:

    private Pattern countryCodePattern = null;
    
    VatNumberVerifier() {
        countryCodePattern = Pattern.compile(COUNTRY_CODE_EXP);
    }

Next we define the method to strip off the country codes (if present). This method uses the pre-configured countryCodePattern Pattern object to determine if the number presented to it starts with a valid county code. If it does, then the first two characters of the VAT number are removed. The resulting string is then returned. If the vat number does not contain a country code then the string is returned as is.

     private String stripCountryCodes(String vatNumber) {
        String result = vatNumber;
        Matcher matcher = countryCodePattern.matcher(vatNumber);
        if (matcher.find()) {
            result = result.substring(2);
        }
        return result;
    }

Finally, we can define the method that will actually perform the verification of the VAT numbers. This method returns true or false depending upon whether the VAT number matches the pattern passed into it.

     public boolean verify(String format, String vatNumber) {
        vatNumber = stripCountryCodes(vatNumber);
        Pattern pattern = Pattern.compile(format);
        Matcher matcher = pattern.matcher(vatNumber);
        return matcher.find();
    }

This method can now be used by a test harness to verify some VAT numbers. For example:

public static void main(String [] args) {
    VatNumberVerifier verifier = new VatNumberVerifier();
    System.out.println("UK Verify: " +
                       verifier.verify(UK, "UK 728 7030 32"));
    System.out.println("UK Verify: " +
                       verifier.verify(UK, "UK HA787"));
    System.out.println("UK Verify: " +
                       verifier.verify(UK, "UK AB56GGF"));
    System.out.println("FRANCE Verify: " +
                       verifier.verify(FRANCE, "FR AA 363478400"));
    System.out.println("GERMANY Verify: " +
                       verifier.verify(GERMANY, "DE 213709651"));
    System.out.println("GERMANY Verify: " +
                       verifier.verify(GERMANY, "DE 136665975"));
}

The result of running this test harness is:

UK Verify: true
UK Verify: true
UK Verify: false
FRANCE Verify: true
GERMANY Verify: true
GERMANY Verify: true

Conclusions

Regular expressions are extremely powerful and very useful and available directly from within Java. However, they need to be used with care. This is both because of the potential complexity of the patterns that can be defined and the ease with an error can be introduced. For example, the pattern \\d{2}\\s{d2} will correctly match the string "12 12" but also match the string "12 123" unless it is delimited to strictly match only the former. In addition, due to the need to include additional "\" characters, the regular expressions can become almost unreadable. However, used wisely they offer huge benefits for processing text for the extraction of data, information, knowledge and analysis.

Further Reading

Mastering Regular Expressions, Friedl, Jeffrey E. F., O'Reilly, Cambridge, MA, ISBN 0596002890, 2002.

Programming Perl, Larry Wall, Tom Christiansen & Jon Orwant, O'Reilly, Cambridge, MA, ISBN 0596000278, 2000.

An Online Regular Expression Tutorial can be found here.

The Java Tutorial: Regular Expressions can be found here.

Top 5 reasons to deploy VMware with Tegile

More from The Register

next story
Preview redux: Microsoft ships new Windows 10 build with 7,000 changes
Latest bleeding-edge bits borrow Action Center from Windows Phone
Google opens Inbox – email for people too thick to handle email
Print this article out and give it to someone tech-y if you get stuck
Microsoft promises Windows 10 will mean two-factor auth for all
Sneak peek at security features Redmond's baking into new OS
UNIX greybeards threaten Debian fork over systemd plan
'Veteran Unix Admins' fear desktop emphasis is betraying open source
Entity Framework goes 'code first' as Microsoft pulls visual design tool
Visual Studio database diagramming's out the window
Google+ goes TITSUP. But WHO knew? How long? Anyone ... Hello ...
Wobbly Gmail, Contacts, Calendar on the other hand ...
DEATH by PowerPoint: Microsoft warns of 0-day attack hidden in slides
Might put out patch in update, might chuck it out sooner
Redmond top man Satya Nadella: 'Microsoft LOVES Linux'
Open-source 'love' fairly runneth over at cloud event
prev story

Whitepapers

Choosing cloud Backup services
Demystify how you can address your data protection needs in your small- to medium-sized business and select the best online backup service to meet your needs.
Forging a new future with identity relationship management
Learn about ForgeRock's next generation IRM platform and how it is designed to empower CEOS's and enterprises to engage with consumers.
Security for virtualized datacentres
Legacy security solutions are inefficient due to the architectural differences between physical and virtual environments.
Reg Reader Research: SaaS based Email and Office Productivity Tools
Read this Reg reader report which provides advice and guidance for SMBs towards the use of SaaS based email and Office productivity tools.
Storage capacity and performance optimization at Mizuno USA
Mizuno USA turn to Tegile storage technology to solve both their SAN and backup issues.