Regular expressions are incredibly powerful, yet most developers don't really understand when or how to use them. This tutorial will take you from regular expression noob to master. (well ok, mabye not master but you will gain quite a bit of knowledge!)

What is Regular Expression Syntax?

It's basically a small language that is used to find things in strings. Think of something complex like a the html for a web page. Suppose you wanted to get only part of the page and return it. Sound simple? Trust me it's harder than you think once you start looking at real world examples, but this is an area that regular expressions shine. (more on that later.)

Testing your Expressions

.NET for some strange reason has no regular expression test client. Since regular expression syntax can be tricky to get right you should always test your expressions before you put them in your code. As a matter of fact I prefer to incrementally test mine, I'll write part of an expression and test that part, when I'm statisified it returns what I want I add more and more until I have the whole expression, basically it's just like coding anything else, you don't write the the whole thing and press run and expect it to work do ya?

There are many applications on the web you can download that let you test regular expressions, but I prefer an extremely small test application where little can go wrong. Here is what I do.

  1. Create a new windows application.
  2. Add a TextBox to your app called "txtExpression"
  3. Add two text boxes to your form and set the Multiline property on both of them to true. Name one "txtData" and the other "txtResults"
  4. Add this line to the top of your code.
    using System.Text.RegularExpressions;
  5. Then add a button add put this code in the button
    txtResults.Clear();

    Regex r = new Regex(txtExpression.Text,RegexOptions.Singleline | RegexOptions.IgnoreCase);

    MatchCollection col = r.Matches(txtData.Text);

    foreach (Match m in col)
    {
    for (int i=0;i>m.Groups.Count;i++)
    {
    txtResults.Text += "(" + i + ")" + m.Groups[i].Value + "\r\n";
    }
    }
    .

Congrats you now have a very quick and dirty regular expression test client! Paste your data in the txtData TextBox and put your regular expression in the txtExpression TextBox and when you press the button the result of the operation will go in the txtResults TextBox. Easy isn't it?

Making your first expression

Ok lets get on with it. Suppose you have a string like this:

The date of that transaction is 2006-12-02 MST.
And you want to get the date out of it. We can write an expression such as this:
\d{4}-\d{2}-\d{2}

When you put that in our test client and smack the button you get a "(0)" at the front. The test client we wrote does this because you can define multiple things you want to return in a regular expression, but for now just ignore the (0) at the front I'll go into this more later.

So what is that junk? Lets take a peek. "\d" for regular expressions means digit. So it simply matches a number. The next part "{4}" tell the previous part how many to match. So it says look for 4 digits. Well that matches 2006 in our example. The next part is "-" this is just a literal and matches a "-" in the target data. You can see the digit part being used again for 2 more digits and again for another two. So this says find 4 digits a - 2 digits a - and two more digits.

Try this one against the same expression:

The date of that transaction is 2006-12-02 MST.
The date of that transaction is 2006-12-15 MST.

You will notice that two results are returned one for each date. It's easy to use regular expressions to pull out any number of items like that from a list. Very helpful indeed.

Wildcards and other fun stuff

Ok so you got digits, what else can it do you wonder? Well lets look at a more complex example, take this html snippet for instance:

<h1>Welcome to My Page of Goodness</h1>
Thank you for visiting my page. Pick a topic:<br>
<ul>
<li>Goodness</li>
<li>Extra Goodness</li>
<li>Ultra Goodness</li>
<li>Scooby Goodness and snacks too!</li>
</ul>

Suppose you want to get each of the items in the list, how would you do it?

Well we can use the <li> tags as a marker and get the stuff between them. To do this in an expression would look like this:

<li>.*?</li>

The "." represents any character or number, and the "*" is a wildcard that means 0 or more of. (yeah I know could have used + instead to do 1 or more of) Now the "?" is more interesting. Try this without the "?" and then with it and see what you get. You will notice that without it matches the first <LI> tag with the last </LI> tag instead of the first </LI> it comes across. The "?" forces it to do minimal matching so that it matches with the first one it finds. Be careful, a ? without a *, + or another ? in front has a different meaning, it means zero or one matches.

Using the same example we can produce other expressions that do the exact same thing such as this one:

<li>[^<]*</li>

Some new syntax in there, the "[]" give context to the meaning in them. The "^" is used as a NOT symbol. So this just says match 0 or more characters except for the "<" character. So it goes until it hits the next tag and stops.

Capture groups are your friends!

Suppose you have an href tag like the one shown below and you want to get both the url and the title of the link? How would you do it?

<a href="http://goodness.is.me.com">My site of wonderful goodness</a>

Obviously you could use the techniques shown previously and capture this twice but that is very wasteful. Regular expressions have a way to do this exact thing with ease, it's called capture groups. A capture group is just a pair of rounded brackets "( )" but it allows you to pull out more than one piece of data. Lets look at an example of how you could get the url and title from the link above.

<a href="(.*?)">(.*?)</a>

First when you run this you will notice that you have three elements being returned (thats what the (0), (1), (2) at the begining represent, the array element (or capture group) of the item being returned. You should get the following output:

(0)<a href="http://goodness.is.me.com">My site of wonderful goodness</a>
(1)http://goodness.is.me.com
(2)My site of wonderful goodness

The syntax of that command should all be familiar as we just covered the wildcards earlier. Notice that we now attempt to match the whole expression and have brackets around parts of it? The brackets in the expression define the capture groups. The Match object returned from a call to RegEx has an array of Group. This array has each of the capture groups in it. The whole regular expression gets captured and put in the 0 element. The first capture group goes in 1 and the second capture group goes in 2. Easy and very powerful to get multiple returns from a single expression match!

Just a quick note: You can give your capture groups names if you like, then in the code you can pull them out of the group array by name instead of index. You can do names like so:

<a href="(?<url>.*?)">(?<title>.*?)</a>

Note that each capture group now has a ?<name> in it. this defines the capture groups name. (another note, whomever created this language loves the ? character!)

Alternation

Suppose you want to match two different things such as this:

grey
gray

Yes I know you could wildcard it like gr\wy to get any word in that spot, but lets be a bit more specific so that it only matches "e" and "a". This is where alternation works. Basically it's just like doing a logical OR statement in c#.

gr(?:e|a)y

I know, I know wtf is the (?: stuff all about. Well the brackets define a capture group and the ?: part tells it not to capture the group, so basically the brackets work more like brackets in math, they just group parts of the expression instead of capturing them.

A bit about options

You notice that the RegEx class can accept some options on the constructor, we used these when we created our test client:

Regex r = new Regex(txtExpression.Text,RegexOptions.Singleline | RegexOptions.IgnoreCase);

The options are very important as they make a huge difference on what gets captured and what does not. Lets look at the two we used here.

The SingleLine option is described in the help as: "Specifies single-line mode. Changes the meaning of the period character (.) so that it matches every character (instead of every character except \n)."
So in short it means that your capture can span multiple lines, else when you do a "*" it stops at the end of the current line. I'm suprised that this isn't the default, as it has bitten me more times than once, so ensure you set the value of this to the function you desire.

Next up is the IgnoreCase option. This one should be self explanitory, it doesn't care what the case of the words are and matches regardless.

The only other one I find I use often is the Compiled option. This compiles the regular expression into your assembly so it goes much faster. But remember this will only work if you know the expression beforehand, if you are planning on providing the expression from a configuration file it will not work as it cannot compile it since it does not have the expression!

A word in parting

Feel like a RegEx master now? Ya me neither, but hey your a few steps closer, hopefully this tutorial has helped you determine when a regular expression can help you get the job done faster.