Expressing yourself in Java: Regular Expressions

They can be rather opaque, but also extremely powerful

Business security measures using SSL

Regular expressions are an area of computing that most of us know a little about, have a rough understanding of, but have often avoided using except when absolutely necessary. In the past, they often required the use of tools such as Perl, rather than languages such as Java.

However, for some time now, in fact since Java version 1.4, there has been a regular expression package within the stand Java libraries. That is, the Java package, java.util.regex. This package contains two classes, which together provide much of the pattern matching power of tools such as Perl.

In this column we will look at what a regular expression is, why it is useful and how the regular expression package can be used from within Java. As a concrete example, we will look at a simpler Java class that can be used to verify the format of VAT numbers form a number of European countries.

Regular expressions

A regular expression (often referred to as a regex) is a pattern describing a certain amount of text. This pattern describes the structure of a string of text. It can be used to determine if an arbitrary string matches the defined pattern. For example, let's assume (it's an oversimplification) that a UK postcode follows a pattern that can be described as:

  • Two letters
  • Followed by 1 or 2 numbers
  • Followed by a space
  • Followed by 1 or 2 numbers
  • And finally by two letters

Within our programs, a regular expression is exactly like this, except that it is presented in a rather more concise (and more powerful) form.

Literal patterns

A very simple regular expression might be the pattern 'John'. This regular expression could be used to determine whether other strings contained the pattern 'John'. In Java, creating a Pattern object generated from the regular expression to be searched for would do this. This Pattern object is then used to create a Matcher object that is derived form the Pattern object and the string to be processed.

For example, the following program allows any string to be searched for the pattern "John". Line 12 defines the pattern, line 13 creates the Pattern object and line 14 creates the Matcher object. As can be seen from the imports the Pattern and Matcher classes are defined in the java.util.regex package. The Pattern class acts as a factory for creating pattern objects. In turn the Pattern object acts as a factory for creating Matcher objects. The resulting Matcher object is then used in line 19 to determine whether the string in args[0] contains the pattern 'John' or not.

package uk.co.regdeveloper;

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class SimpleVerifier {

     * @param args
    public static void main(String[] args) {
        String format = "John";                      // line 12
        Pattern pattern = Pattern.compile(format);   // line 13
        Matcher matcher = pattern.matcher(args[0]);  // line 14
        System.out.println(format +
                           " is contained in the String >" +
                           args[0] +
                           "< -> " +
                           matcher.find());          // line 19

Regular expression patterns

The previous example, while functionally complete, is not the most exciting regular expression that can be imagined. Let us consider instead a VAT number. Let us start with the structure of a VAT number in France. It is:

1 block of two characters, a space and then 1 block of 9 digits

We could, of course, write a program that could check each character in a String to determine whether it matched this rule. However, with regular expressions we can do this in a far more elegant way. We can define a regular expression pattern that contains Character sets, repetition and Short hand character classes. For example, we want to say that the VAT number starts with a block of two characters. This can be done using the pattern:


This states that any character A to Z (upper case or lower case) can appear twice in this pattern. If this pattern were used to create a matcher object, then the resulting matcher would return the following results:

AB -> true
A -> false

Such patterns are so common that there is a short hand form for the letters A-Za-z which is "\w". This short hand form matches a "word character" (alphanumeric characters plus underscore). There are also equivalents for digits ("\d") and for spaces ("\s"). Thus if we wish to write a pattern for two characters, a space and two numbers we could create the pattern:


This is exactly what has been done in the following FormatTester program. Note that because "\" has a special meaning within Java strings, it must be escaped by another "\". This does make the pattern less readable, but maintains compatibility with existing regular expression formats.

package uk.co.regdeveloper;

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class FormatTester {

    public static void main(String[] args) {
        String format = "\\w{2}\\s\\d{2}";
        Pattern pattern = Pattern.compile(format);
        Matcher matcher = pattern.matcher("AB 12");
        System.out.println(matcher.find());   // line 12
        matcher = pattern.matcher("ABC1A1");
        System.out.println(matcher.find());   // line 14


The output at line 12 returns true, while the output at line 14 returns false.

Going further with Regular Expression Patterns

There is a great deal that you can do with regular expressions. For example, you can indicate that a character is optional (using a ?). You can anchor parts of a pattern to a certain position. You can even take into account what is around a particular character. For example, q(?=u) matches the q in question, but not in qestion (note the missing u). You can also use alternation. Alternation is the regular expression equivalent of "or". For example, you can write "Jack|Jill" for Jack or Jill.

To really exploit the power of regular expressions you should start by looking at some of the regular expressions material listed at the end of this column.

A practical example

To illustrate how regular expressions might be used in a larger program we will look at a simple European VAT number verifier. The aim of this program is to verify if a given string is a valid European VAT number given a pattern for a particular countries VAT code. The Program is presented below.

The first part of the program defines the patterns that will be used to determine whether a particular VAT number confirms to a particular countries VAT number format. Note that the formats presented are illustrative and intended to show various aspects of regular expressions within Java and should not be treated as a full or complete definition of a set of regular expressions for VAT number processing.

The patterns defined display various aspects of the pattern language we have looked at above for regular expressions. For example, the COUNTRY_CODE_EXP defines any two characters, while the GERMANY VAT number pattern defines a sequence of nine digits. In turn the FRANCE pattern is a combination of two characters, a space and nine digits. The UK pattern incorporates these features and includes the alternates notation ("or") using the "|" to indicate that the pattern can be either a sequence of digits and spaces, or a pattern of two letters and three numbers where the two letters must either be GD or HA.

package uk.co.regdeveloper;

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class VatNumberVerifier {
    /** If a VAT number starts with a country code we will
     * remove it */
    private static final String COUNTRY_CODE_EXP = "\\w{2}";  
    /** 1 block of 9 digits */
    private static final String GERMANY = "\\d{9}";
     * 1 block of two characters, a space and then
     * 1 block of 9 digits
    private static final String FRANCE = "\\w{2}\\s\\d{9}";
     * 999 9999 99 or
     * GD999 or
     * HA999.
     * GD Identifies Government Departments.
     * HA Identifies Health Authorities.
    private static final String UK = 

Following on from this we define a single instance variable to hold the country pattern and initialise it within the constructor:

    private Pattern countryCodePattern = null;
    VatNumberVerifier() {
        countryCodePattern = Pattern.compile(COUNTRY_CODE_EXP);

Next we define the method to strip off the country codes (if present). This method uses the pre-configured countryCodePattern Pattern object to determine if the number presented to it starts with a valid county code. If it does, then the first two characters of the VAT number are removed. The resulting string is then returned. If the vat number does not contain a country code then the string is returned as is.

     private String stripCountryCodes(String vatNumber) {
        String result = vatNumber;
        Matcher matcher = countryCodePattern.matcher(vatNumber);
        if (matcher.find()) {
            result = result.substring(2);
        return result;

Finally, we can define the method that will actually perform the verification of the VAT numbers. This method returns true or false depending upon whether the VAT number matches the pattern passed into it.

     public boolean verify(String format, String vatNumber) {
        vatNumber = stripCountryCodes(vatNumber);
        Pattern pattern = Pattern.compile(format);
        Matcher matcher = pattern.matcher(vatNumber);
        return matcher.find();

This method can now be used by a test harness to verify some VAT numbers. For example:

public static void main(String [] args) {
    VatNumberVerifier verifier = new VatNumberVerifier();
    System.out.println("UK Verify: " +
                       verifier.verify(UK, "UK 728 7030 32"));
    System.out.println("UK Verify: " +
                       verifier.verify(UK, "UK HA787"));
    System.out.println("UK Verify: " +
                       verifier.verify(UK, "UK AB56GGF"));
    System.out.println("FRANCE Verify: " +
                       verifier.verify(FRANCE, "FR AA 363478400"));
    System.out.println("GERMANY Verify: " +
                       verifier.verify(GERMANY, "DE 213709651"));
    System.out.println("GERMANY Verify: " +
                       verifier.verify(GERMANY, "DE 136665975"));

The result of running this test harness is:

UK Verify: true
UK Verify: true
UK Verify: false
FRANCE Verify: true
GERMANY Verify: true
GERMANY Verify: true


Regular expressions are extremely powerful and very useful and available directly from within Java. However, they need to be used with care. This is both because of the potential complexity of the patterns that can be defined and the ease with an error can be introduced. For example, the pattern \\d{2}\\s{d2} will correctly match the string "12 12" but also match the string "12 123" unless it is delimited to strictly match only the former. In addition, due to the need to include additional "\" characters, the regular expressions can become almost unreadable. However, used wisely they offer huge benefits for processing text for the extraction of data, information, knowledge and analysis.

Further Reading

Mastering Regular Expressions, Friedl, Jeffrey E. F., O'Reilly, Cambridge, MA, ISBN 0596002890, 2002.

Programming Perl, Larry Wall, Tom Christiansen & Jon Orwant, O'Reilly, Cambridge, MA, ISBN 0596000278, 2000.

An Online Regular Expression Tutorial can be found here.

The Java Tutorial: Regular Expressions can be found here.

New hybrid storage solutions

More from The Register

next story
'Windows 9' LEAK: Microsoft's playing catchup with Linux
Multiple desktops and live tiles in restored Start button star in new vids
Not appy with your Chromebook? Well now it can run Android apps
Google offers beta of tricky OS-inside-OS tech
New 'Cosmos' browser surfs the net by TXT alone
No data plan? No WiFi? No worries ... except sluggish download speed
iOS 8 release: WebGL now runs everywhere. Hurrah for 3D graphics!
HTML 5's pretty neat ... when your browser supports it
Greater dev access to iOS 8 will put us AT RISK from HACKERS
Knocking holes in Apple's walled garden could backfire, says securo-chap
NHS grows a NoSQL backbone and rips out its Oracle Spine
Open source? In the government? Ha ha! What, wait ...?
Google extends app refund window to two hours
You now have 120 minutes to finish that game instead of 15
Intel: Hey, enterprises, drop everything and DO HADOOP
Big Data analytics projected to run on more servers than any other app
SUSE Linux owner Attachmate gobbled by Micro Focus for $2.3bn
Merger will lead to mainframe and COBOL powerhouse
prev story


Providing a secure and efficient Helpdesk
A single remote control platform for user support is be key to providing an efficient helpdesk. Retain full control over the way in which screen and keystroke data is transmitted.
WIN a very cool portable ZX Spectrum
Win a one-off portable Spectrum built by legendary hardware hacker Ben Heck
Storage capacity and performance optimization at Mizuno USA
Mizuno USA turn to Tegile storage technology to solve both their SAN and backup issues.
High Performance for All
While HPC is not new, it has traditionally been seen as a specialist area – is it now geared up to meet more mainstream requirements?
Security and trust: The backbone of doing business over the internet
Explores the current state of website security and the contributions Symantec is making to help organizations protect critical data and build trust with customers.