Regular expressions in Java

Introduction

You might remember that one article I wrote about Lambda expressions in Java last year. In that article I took a quick look at what Lambda expressions are and how you can use them in Java. This article was quite popular and I thought it’s about time that I write something about regular expressions here on nerdhut. However I only wanted to give you an introduction about regular expressions and how you can use them in your Java code to detect patterns in a text or to search for something.

What expressions?

Regular expressions! I don’t even want to dig into this topic too deep here, because I want to keep the article short, but I want to explain the concept of a regex (short for regular expression, “regexp” is also sometimes used) as good as I can. If you are interested in the concepts behind regexp and computer languages, I’ll link some resources at the end of this article, so you can dig into the topic deeper on your own.

A regular expression is a string which can be used to describe other strings. This might sound very abstract at first, but you can imagine it this way: A regular expression is a rule. Every string that matches this rules, is valid, all other ones are not valid and your application can react to valid and invalid inputs, when they are detected. Let’s look at some examples right away:

Let’s say, that we have this set of strings. Each line is a new string:

.define asdf 1234
this is a regular line of code
....
//this is a comment

A regex that would match every string that starts with a period looks like this:

^\..*$

If you have never seen something like this, that might look scary at first, but let’s look at another example that might be a bit easier to understand right away:

[a-z]+@[a-z]+\.[a-z]{2,3}

Can you guess what set of strings this regex describes?
Hint: You can test regular expressions here.

-> Simple E-Mail-addresses! <-

Now let’s take a closer look at this regex and take it apart:

[a-z]+ means that a string has to start with lowercase characters between a and z and the plus afterwards means, that it has to have at least one of these characters. This is the name of the person, you want to send an E-Mail to.

The @ should be pretty self-explanatory. It means that the string needs to have an @-symbol right after the first characters, to match.

the @ then needs to be followed by more than 1 lower case letter from the alphabet again. This is the domain name of the E-Mail server.

Afterwards there has to be a period in the string, indicated by the \. in the regex. Note that you have to escape the period-symbol with a backslash, otherwise you’d define a regex that allows any character at that position.

The last part of the regex [a-z]{2,3} means, that a correct string has to end with either two or three lowercase characters again, which form the TLD of our address.

So these would be correct addresses according to the regex from above:

office@nerdhut.de
a@b.com
let@me.in

And the following ones would be incorrect strings:

peter123@mail.free
1@1.1

So as you can see, a regex is nothing more than a rule that describes what a string has to look like.

Regex rules

In the section above, I gave you some example regular expressions. But what other possibilities are there to describe what a string has to look like? I want to explain the possibilities of posix regex here. Note that this is not a complete table! These are only some of the rules (the ones that I use most of the time).

First, let’s take a look at the following table, which describes the possibilities to detect a single character:

symbol meaning
 .  any character
 \w  any word
 \d  any digit
 \s  whitespace
 \W \D \S  NOT \w \d \s
 [abcd] any of these characters (but only once!) so: a XOR b XOR c XOR d
 [^abcd]  NOT [abcd]
 [a-z]  any character between a and z (but only once!)
shorter form of: [abcdef…xyz]
 x  the character x, note that you can use any character here, for example @ or ß or û. However special characters need to be escaped, like the period, which means any character otherwise.
 \.  Escaped special character, matches . in the string. Other examples: \? (matches ?) and \\ (matches \)
 \t Tab
\n linefeed
\r carriage return
\u20DE unicode character 20DE:   ⃞

You can combine the symbols from the table above with the following quantifiers:

symbol meaning
* 0 or more
+ 1 or more
? 0 or exactly 1
{x} exactly x
x can be any Integer
{x,} x or more
{x,y} x or more but less than y (x to y)
? match as few as possible

A quantifier is written behind the character that it should be applied to. So for example if you want a string to match, that has 5 to 10 a’s, your regex would look like this:

a{5,10}

And then there are these other two rules, that I’d like to discuss:

symbol meaning
| XOR. Example:
abc|xyzEither abc or xyz (exactly like this)
^x$ start/end of a string. Useful when you work with multiline strings. x can be an entire regex.

You can use parentheses to group specific parts of a string. For example if you wanted to extract the name, the server and the TLD from the E-Mail example above, you could use parentheses to put each character from the name, the server and the TLD together into one group. This way you can easily extract these values later on (See the Java example below).

Quick exercise

Use the regexpal-webapp to write a very simple regular expression that matches specific URLs (with protocol!). Test-data:

https://www.nerdhut.de
ftp://dataserver.com
noname.server
ftp:nerdhut.de
smb://www.myweb12.com
smb://shop.rubberduck.com

The first two and the last one should match. Try to find the pattern and write a regex that describes it!

Regex and Java

Now I quickly want to discuss how you can use regular expressions in Java to detect patterns in strings. I don’t want to go into too much detail here, I’ve linked some resources at the end of this article if you want to learn more about the Java regex API.

For now let’s imagine the following scenario: You’d like to develop a fancy new text editor that highlights each line that starts with a period. How could you do this? For a task that simple you could just use the following method (Pseudocode):

while(linesleft)
{
    if(line.startswith('.'))
        line.highlight();
}

From above we know, that the following regex will do the same job:

\..*

But how can we get this into our Java application? Here’s how:

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class FancyEditor
{
    public static void main(String[] args)
    {
        // These are the lines we want to check
        String[] lines = new String[4];
        lines[0] = ".define asdf 1234";
        lines[1] = "this is a regular line of code;";
        lines[2] = "....";
        lines[3] = "//this is a comment";

        // This is the regular expression
        // Important! Don't use the ^ and $ here, because we
        // look at a single string at a time
        // And also don't forget to double-escape backslashes
        String regex = "\\..*";

        // Create a Pattern object
        Pattern p = Pattern.compile(regex);

        for (String line : lines)
        {
            Matcher m = p.matcher(line);

            if(m.matches())
                System.out.println("Highlight this line: " + line);
        }
    }
}

And when you run this example, you can see, that the correct lines are written to the console and the code works.

E-Mail example with Java

Now let’s extend the E-Mail-example from above: Let’s a short program that outputs the person’s name, the server and the TLD of an E-Mail that a user types in and outputs an error message, if a malformed address is supplied. For this we’ll use the groups, mentioned earlier and I want to add three groups. One for the name, one for the server and one for the TLD:

([a-z]+)@([a-z]+)\.([a-z]{2,3})

As you can see, I added parentheses around the characters, that I want to form a group. If we test this regex with the data from above, we can see, which characters get grouped together:

Bildschirmfoto 2017-10-14 um 16.05.23
Figure 1: Regex tester showing the groups of a test-string

You can also see, the name is in group 1, the server in group 2 and the TLD in group 3. Now we can use this information to modify our first code-example:

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class MailAddressChecker
{
    public static void main(String[] args)
    {
        // These are the lines we want to check
        String input = "office@nerdhut.de";

        // This is the regular expression
        // Important! Don't use the ^ and $ here, because we
        // look at a single string at a time
        // And also don't forget to double-escape backslashes
        String regex = "([a-z]+)@([a-z]+)\\.([a-z]{2,3})";

        // Create a Pattern object
        Pattern p = Pattern.compile(regex);
        Matcher m = p.matcher(input);

        // Input was a valid address
        if(m.matches())
        {
            System.out.println("----------------");
            System.out.println(input);
            System.out.println("----------------");
            
            // group 0 usually contains the whole string
            System.out.println("Name: " + m.group(1));
            System.out.println("Server: " + m.group(2));
            System.out.println("TLD: " + m.group(3));
        }
        else
            System.out.println("No valid E-Mail address supplied!");
    }
}

Conclusion

So a regex is basically nothing more than a (set of) rule(s) that describe what a string has to look like. You can use regular expressions to perform a lot of tasks, for example:

pattern detection/pattern matching
searching/inserting/replacing
error detection/correction (input errors, i.e. E-Mails)
text highlighting/auto indentation

Regular expressions are quick, but sometimes it might take you more time to detect a pattern yourself and to write a regular expression that matches most of the cases. So you should always evaluate, whether regular expressions are the right thing for your use-case.

Further resources

Formal languages – Wikipedia
Regular expressions – Wikipedia
Regular expressions course – Oracle
Regular expressions in Java – Tutorialspoint
Regex tester – Regexpal

comment-banner

4 thoughts on “Regular expressions in Java

Leave your two cents, comment here!

This site uses Akismet to reduce spam. Learn how your comment data is processed.