Working with regular expressions in .NET is centered around the
Regex class. Its most important methods are:
These methods are defined as both instance and static methods on the
Regex class, allowing you to use them in two ways:
// Instance method new Regex(@"\d+").IsMatch("12345") // True // Static method Regex.IsMatch("12345", @"\d+") // True
Note the order of parameters in the static method: First comes the input, then the pattern. This has bitten me more than once.
All the methods listed above allow you to pass in a
RegexOptions value which tells the regex engine how to interpret the pattern and perform the matching. On top of that, the
Regex class lets you pass in some options into its
Regex(String, RegexOptions) constructor.
The following options are defined in the
Because the enumeration is decorated with
[Flags], you can combine any of the above options using the
var options = RegexOptions.IgnoreCase | RegexOptions.CultureInvariant | RegexOptions.ExplicitCapture;
In this post, I want to highlight a use case for each of the
RegexOptions values. For a concise summary of all options, please refer to the RegexOptions Enumeration MSDN page.
By default, the regex engine of .NET interprets regular expressions. It can also compile a regular expression to MSIL for increased matching performance, which is what the
RegexOptions.Compiled flag specifies:
private static readonly Regex _digitsOnly = new Regex(@"^\d+$", RegexOptions.Compiled);
While a compiled regular expression executes slightly faster, it takes significantly more time to build. We're talking about orders of magnitude here! Compiling a regular expression will therefore only be advantageous if it's used repeatedly, e.g. in a loop or over the application's lifespan.
A good example of when it makes sense to compile a regular expression is its use in a components that's called repeatedly, such as Jeff Atwood's MarkdownSharp: It makes heavy use of regular expressions which are initialized once and stored in a static field to be reused over and over again.
When you specify
RegexOptions.IgnoreCase, the regex engine has to somehow compare uppercase and lowercase characters. By default, it uses the current culture (
Thread.CurrentThread.CurrentCulture) when doing string comparisons. You'll see in a second why this can lead to unexpected results. Take this short code snippet, for example:
Thread.CurrentThread.CurrentCulture = new CultureInfo("tr-TR"); string inputFilePath = "FILE://C:/sample_file.txt"; string filePathPattern = "^file://";
We're using the Turkish culture and defining a file path and our regular expression pattern. If we now try to match the
inputFilePath variable against the pattern, the result will be
// False Regex.IsMatch(inputFilePath, filePathPattern, RegexOptions.IgnoreCase)
This is because in the Turkish language, 'i' is not the lowercase equivalent of 'I', which is why the comparison fails despite the case-insensitive comparison specified by
RegexOptions.CultureInvariant will yield a match:
// True Regex.IsMatch(inputFilePath, filePathPattern, RegexOptions.IgnoreCase | RegexOptions.CultureInvariant)
Conclusion: If you're matching written text against a pattern that contains written text itself and you have no control over the culture your code is run in, consider the
The .NET regex engine uses its own flavor and provides additions that aren't supported in other engine, such as the ECMAScript regex engine. By using the
RegexOptions flags can't be combined with
RegexOptions.ECMAScript because they aren't defined in ECMAScript's regex engine. Those are:
While the MSDN is a great resource for regular expressions in .NET, there seems to be a discrepancy between the documentation and the actual framework code:
RegexOptions.ECMAScriptoption can be combined only with the
RegexOptions.Multilineoptions. The use of any other option in a regular expression results in an
ArgumentOutOfRangeException. ECMAScript Matching Behavior, MSDN
However, the implementation of the
Regex class says otherwise. A combination with
RegexOptions.InvariantCulture is allowed, too, as is
RegexOptions.Compiled — given that you're not using Silverlight. You simply gotta love open source, don't you?
Grouping parts of a regular expression using parentheses —
) — tells the regex engine to store the value of the matched subexpression so that it can be accessed later. If you don't ever do anything with the matched value, though, saving it is unnecessary overhead. This is why there's the concept of non-capturing groups which group a subexpression of a regex, but don't store its value for later reference.
Non-capturing groups start with
(?: and end with
var matches = Regex.Matches( "Possible colors include darkblue and lightgreen.", "(?:dark|light)(?:blue|red|green)" );
When your pattern contains lots of non-capturing groups, maybe even nested ones, its readability likely gets worse: The pattern gets longer and if you're not paying attention, you might mistake the
? for the optional quantifier.
RegexOptions.ExplicitCapture turns all capturing groups that aren't explicitly named (see Named Matched Subexpressions) into non-capturing groups and thus allows for a cleaner syntax with less noise:
var matches = Regex.Matches( "Possible colors include darkblue and lightgreen.", "(dark|light)(blue|red|green)", RegexOptions.ExplicitCapture );
By default, regular expressions are matched against strings case-sensitively:
Regex.IsMatch("abc", "abc") // True Regex.IsMatch("ABC", "abc") // False
If you specify
RegexOptions.IgnoreCase, both input strings (
ABC) will be matched by the pattern
Regex.IsMatch("abc", "abc", RegexOptions.IgnoreCase) // True Regex.IsMatch("ABC", "abc", RegexOptions.IgnoreCase) // True
It's especially handy to use the
RegexOptions.IgnoreCase flag when using character classes:
[a-zA-Z] can then be shortened to
[a-z]. If you need to do a case-insensitive match, specifying this flag helps you write clearer, shorter, and more readable patterns.
Be careful, though, with behavior of different cultures. If you don't know ahead of time which culture your code will be run under, consider using the
IgnoreCase flag in combination with
Whitespace characters in a regular expression pattern are treated as whitespace literals by default: If there's a space in the pattern, the engine will attempt to match a space in the input string. You have significant whitespace, if you will.
RegexOptions.IgnorePatternWhitespace options allows you to structure your pattern using insignificant whitespace as you like. You can even write your pattern across separate lines, which works perfectly together with C#'s verbatim strings:
const string identifierPattern = @" ^ # Identifiers start ... [a-zA-Z_] # ... with a letter or an underscore. [a-zA-Z_0-9]* # Possibly some alphanumeric characters ... $ # ... and nothing after those. "; var identifierRegex = new Regex(identifierPattern, RegexOptions.IgnorePatternWhitespace); bool validIdentifier = identifierRegex.IsMatch("_emailAddress"); // True
As the above example shows, you can also include comments: Everything after the
# symbol until the end of the line will be treated as a comment. When it comes to improving a pattern's readability,
RegexOptions.IgnorePatternWhitespace will likely make the most notable difference. For a real-world example, take a look at a couple of regex patterns in MarkdownSharp that really benefit from
RegexOptions.Multiline flag changes the meaning of the special characters
$. Usually, they match at the beginning (
^) and the end (
$) of the entire string. With
RegexOptions.Multiline applied, they match at the beginning or end of any line of the input string.
Here's how you could use
RegexOptions.Multiline to check whether some multi-line string (e.g. from a text file) contains a line that only consists of digits:
Regex.IsMatch("abc\n123", @"^\d+$") // False Regex.IsMatch("abc\n123", @"^\d+$", RegexOptions.Multiline) // True
RegexOptions.None is the simplest option: It instructs the regular expression engine to use its default behavior without any modifications applied.
The regular expression engine searches the input string from left to right, or from first to last, if you will. Specifying
RegexOptions.RightToLeft changes that behavior so that strings are searched from right to left, or from last to first.
Note that the
RegexOptions.RightToLeft flag does not change the way the pattern is interpreted: It will still be read from left to right (first to last). The option really only changes the direction of the engine walking over the input string. Therefore, all regex constructs – including lookaheads, lookbehinds, and anchors – function identically.
RegexOptions.RightToLeft might result in increased performance if you're looking for a single match that you expect to find at the very end of the string, in which case you'll probably find it faster this way.
RegexOptions.Singleline changes the meaning of the dot (
.), which matches every character except
\n. With the
RegexOptions.Singleline flag set, the dot will match every character.
Sometimes, you'll see people use a pattern like
[\d\D] to mean "any character". Such a pattern is a tautology, that is, it's universally true — every character will either be or not be a digit. It has the same behavior as the dot with
In practice, I often find myself using the combination of the following options:
var options = RegexOptions.Compiled | RegexOptions.CultureInvariant | RegexOptions.ExplicitCapture | RegexOptions.IgnoreCase | RegexOptions.IgnorePatternWhitespace;
Since most of my work is web-related, compiled regular expressions in static fields generally make sense. The last three flags help me keep my patterns simple and readable.