Regular Expressions in .NETBy Darren Neimke
|For More on Regular Expressions...|
The purpose of this article is to build upon the existing pool of regular expression articles by
providing an overview of the new regular expression features found in .NET and to offer some guidelines as to
when and how to use them. The reader of this article should be familiar with what regular expressions
are and their base features.
If you are new to regular expressions, check out the Regular Expressions Article Index. A great beginner-level article on RegExs can be found at: An Introduction to Regular Expressions with VBScript. There is also a Regular Expressions FAQ Category over at ASPFAQs.com.
Although I was familiar enough with the basic concepts of regular expressions to use them in VBScript and JScript, I noticed that I was struggling to understand many regular expressions I found in examples and documentation. Some of the new features such as lookaround and named capturing left me feeling more than a little overwhelmed. In addition to this, the documentation for regular expressions was scant and quite often with little or no sample code. Because of this, I initially steered away from using regular expressions in my .NET projects altogether.
In this article I hope to highlight some of these new areas and hopefully de-mystify them in such a way that you won't find yourself in the position that I did.
Matching: Groups and Named Captures
From previous regular expression authoring you will likely be familiar with the concept of referencing parenthesized captures via the
$1...$Nnotation - these are referred to as backreferences. To demonstrate this, consider the following VB.NET sample:
The above pattern matches two words separated by a comma and a space, captures the surname and the firstname of a user and formats them in firstname, surname order. The result is that the value "Darren Neimke" would be displayed in the browser.
In the Replace statement the
$N notation refers to the
Nth group of
parenthesis (captures). An important point to note is that, in .NET the zeroth element (
refers to the entire matched text - "Neimke, Darren" in the case of the above example.
The Regex class now offers some convenient shared (static) members that allow simple statements to
be in-lined, thus reducing the need for unneccessarily bulky code structures such as the one shown
above. The useful static members are:
Split. Using this syntax allows for the previous code to be
The reduced code benefits can be further seen with another example, using
ensure that a string contains a Decimal number pattern before executing some code:
Prior to .NET, a regular expressions
Match object contained many
SubMatches. This has
remained the same in .NET although they are now referred to as
are a collection property of a
Match object and each captured group can be accessed via
it's index (remembering that index 0 refers to the entire match), like so:
This would display the text "Darren" as it is the captured Group at index 2.
Groupscan be assigned names via the new
(?'nameOfGroup'...)syntax. For consistency with other flavors of regular expressions - such as Perl - I prefer the first syntax and it is the one that is most commonly used. Assigning names to groups helps to make your code more self-describing and can lead to improved maintainability. Here's an example of naming the two captures:
While captures provide a lot of power, they can incur quite a performance hit. With regular expressions in VBScript and JScript, capturing occurred whenever you used parenthesis in a regular expression pattern. Sometimes, though, you need to use parenthesis, but you don't need capturing. For example, if you wanted to match either "Let's go this way" or "Let's go that way" you could use the following regular expression:
Let\'s go th(is|at) way
The parentheses with the pipe indicate an option. The pattern matches either "is" or "at" after the "th". Unfortunately, this regular expression incurs an unneeded performance hit because the captured text (either "is" or "at") is remembered via a backreference.
Fortunately, .NET regular expressions provide the
(?:...) syntax, which allows for grouping
to be done without incurring the performance hit of captured text being "remembered" as a backreference.
Using this syntax, the above regular expression could be changed to:
Let\'s go th(?:is|at) way
That pattern would match either:
- "Let's go this way"
- "Let's go that way"
But would only contain one captured group, referenced as
Groups(0). This can obviously
lead to significant performance gains, especially when complex patterns are applied to even moderately
large bodies of text.
Lookaround is a feature that is partially implemented in JScript but not in VBScript. There are two directions of lookaround - lookahead and lookbehind - and two flavors of each direction - positive assertion and negative assertion. The syntax for each is:
(?=...)- Positive lookAHEAD
(?!...)- Negative lookAHEAD
(?<=...)- Positive lookBEHIND
(?<!...)- Negative lookBEHIND
Understanding look(ahead|behind) requires an understanding of the difference between matching text and matching position. To help with this understanding I should state first that lookaround assertions are non-consuming. To see what I mean, let's look at the following simple example.
When the above pattern is applied to the text the "context" of the parser sits at a position in the text between the "t" and the "i" in the word testing. This is because the regular expression parser bumps along the string as it gets a match, like so:
- Start - ^testing
- Match "t" - t^esting
- Match "e" - te^sting
- Match "s" - tes^ting
- Match "t" - test^ing
Once the parser has moved beyond a position there is no way to reverse up and re-attempt a match.
To understand where this causes difficulty, consider this, what if you needed to match the word
"test" but only when it was contained in the word "tested" and not any other possible combination
such as "tester". With lookahead you can simply assert that condition like so:
This works because, with lookaround, the parser is not bumped along the string. This can be
especially useful for finding a position in a document by combining a lookahead assertion with a
lookbehind assertion. To demonstrate, let's consider that we need to match the string "test" when it
was contained within the string "protested" but not "detested". To do this you can do a negative,
lookbehind assertion on "de" and a positive lookahead assertion on "tested", like this:
In other words you are matching a position at which to start matching text. The above pattern would set the parser at the following position in the string "protested"
- Start - pro^tested
- Match "t" - prot^ested
- Match "e" - prote^sted
- Match "s" - protes^ted
- Match "t" - protest^ed
Another good example of using lookaround would be to validate "special" password conditions such as: "Password must be between 8 and 20 characters, must contain at least 2 letter characters and at least 2 digit characters. It can only contain either letter or digit characters."
For such a password constraint, the following expression would probably do quite nicely:
Readability and Maintainability
One of my personal favorite new features is the ability to have embedded comments in regular expressions. Most of us will have, at one time or another come across a regular expression that looks somewhat like this:
If you are lucky you might find a comment that alludes to the purpose of the regular expression, but,
when the time comes to maintain the expression you are undoubtedly left with a sense of anxiety and,
more often than not, a complete re-write is undertaken as opposed to some minor maintenance
operation. .NET allows regular expression patterns to be authored with embedded comments via the
RegExOptions.IgnorePatternWhitespace compiler option and the
embedded within each line of the pattern string.
This allows for psuedo-code-like comments to be embedded in each line and has the following affect on readability:
Finally, a really useful addition to the .NET Framework is that the
Regex.Replace()method allows the use of a delegate as the "replacement" argument. To understand what I'm talking about, consider the following snippet:
After the replace operation has occurred, the value of myString will be "a a a of a a" and it's fairly obvious what happened. Every time the regular expression parser found a match within the string it replaced it with the letter "a". That's all nice and easy if all you need to do is a straight replace, but what about if you need to implement some sort of business logic into the check or you need to "touch" the sub-matches in some way and re-build the replaced string.
A good enough example is converting all words within a body of text to proper case (i.e. first letter
capitalized). To do this your first instincts might be to create a pattern like so:
You could then enumerate the matches, convert the first sub-match to its uppercase version, join the
sub-matches and re-append them to a
StringBuilder instance, like so:
That would work fine if your string contained only word characters, but, what if it looked like this:
~~~ This %%% is ### a chunk of text.
After the replacement operation you would end up with the following string meaning that all non-word
characters that didn't participate in the matches were dropped:
There are ways around it, mostly by building bigger, more complex patterns and doing more string
building inside the match collection iteration.
A more elegant solution is to wire-up a
MatchEvaluator delegate. You can think of a
MatchEvaluator as an event handler that fires when an "
occurs. You provide the
MatchEvaluator with a pointer (reference) to handler function
and that function will be called each time a match is encountered. The function must take a
Match parameter as its single argument and must return a String back to the regular
Replace method that invoked it. This method of replacement allows you the
flexibility to do all sorts of operations transparently to the
Replace method itself,
and because it is all handled within the
Replace method call, you are not left with
having to re-build a string as in the previous example.
A demonstration is in order - let's re-write our previous failed attempt at converting a string to proper case using delegates:
As you can see, the separation is much cleaner and having the replacement logic handled in a separate handler method allows you to implement very complicated operations without affecting readability, maintainability or - and most importantly - data integrity as a result of missing data in a string re-building operation.
Novice programmers often tend to rely heavily on inelegant, unweildy, or slow solutions that focus heavily on string handling operations; programmers with a higher command of languages are more commonly turning to regular expressions to manage and manipulate chunks of text.
The .NET flavor of regular expressions allows regular expressions to be written in a more efficient and maintainable manner. While learning and mastering regular expressions takes time, the ultimate reward is an increased ability to provide accurate solutions efficiently.
There is a sample ASP.NET Web page that uses many of the advanced features
discussed in this article that you can try out. Specifically, the sample Web page
retrieves the HTML from a remote Web server and then prefixes a URL to all hyperlinks that do not start