Why Using the Greedy .* in Regular Expressions Is Almost Never What You Actually Want
Yesterday, I stumbled upon the StackOverflow question How to Extract Data Between Square Brackets Using Perl in which the asker wants to use regular expressions to parse out tuples of values wrapped in square brackets and separated by a comma:
This is the range of values (a1,b1) and [c1,d1].
In the above example, the expected match would be [c1,d1]
with two capturing groups holding the values c1
and d1
, respectively. One user answering the question suggested the use of .*
in their pattern, which is pretty much never what you want. Here's why.
tl;dr:
- Don't use
.*
unless you know what you're doing. - Use
.*?
instead or avoid the dot altogether.
The Dot: Matching (Almost) Arbitrary Characters #
Outside of a character class in a regular expression, the dot (.
) will match any character except a newline; within a character class, the dot is interpreted as a literal and matches the dot character. Most regular expression implementations let you specify a flag instructing the engine to also match newline characters with the dot. Oftentimes, the flag is abbreviated as s
, and in .NET its name is RegexOptions.Singleline
.
Greedy Matching: Gimme, Gimme, Gimme! #
To specify the number of times a token should be matched by the regex engine, you can choose one of the following quantifiers:
?
— match the token zero times (not at all) or exactly once*
— match the token zero or more times+
— match the token one or more times{m,n}
— match the token betweenm
andn
(both including) times, wherem
andn
are natural numbers andn ≥ m
.
In general, the regex engine will try to match as many input characters as possible once it encounters a quantified token like \d+
or, in our case, .*
. That behavior is called greedy matching because the engine will eagerly attempt to match anything it can.
The opposite of greedy matching is lazy matching, which will instruct the engine to match as few input characters as possible and then proceed to the next token in the regular expression pattern. Lazy quantifiers are denoted by appending a ?
to the quantifier symbol, yielding the following lazy quantifiers:
??
*?
+?
{m,n}?
Take the input abc123
, for example. The pattern [a-z]+\d+
(using greedy quantifiers +
) will match the entire string abc123
, while the pattern [a-z]+?\d+?
(using lazy quantifiers +?
) will only match abc1
. Although [a-z]+?
tries to only match one letter, it'll reluctantly try to match more letters if required for the pattern to successfully match the input as a whole.
Backtracking and Input Matching #
As you've seen, a greedy quantifier will try to match as much as it possibly can and only give back matched characters as needed. Every time the engine greedily consumes one more character (or repeated token in general), it has to remember that it made that choice. It will therefore persist its current state and store it so it can come back to it later in a process we call backtracking. When the regular expression engine backtracks, it performs another match attempt at a different position in the pattern.
Storing this backtracking position doesn't come for free, and neither does the actual backtracking process. Because of that it's desirable to minimize the amount of backtracking we're forcing the engine to do. While this isn't too much of a problem for successful matches in small inputs, this kind of optimization is even more relevant for large input strings.
Let's assume the singleline flag is set (so that the dot will match any character) and consider the following pattern proposed in the StackOverflow thread:
\[(.*),(.*)\]
Note that the opening and closing brackets needed to be escaped because they're special characters in a regular expression. With a preceding backslash, the regex engine treats them as literals rather than character class boundaries.
Here's how the pattern is matched against some input:
- First, it tries to match an opening bracket:
\[
- After that, it tries to match (and save) "any amount of anything":
(.*)
- Now it tries to match the separator, a literal comma:
,
- Again, it tries to match (and save) "any amount of anything":
(.*)
- Finally, it tries to match a closing bracket:
\]
So far, so good — but where's the problem?
Bad Performance and Incorrect Matches #
Once the regex engine encounters the first .*
, it'll match every character until the end of the input because the star quantifier is greedy. However, the token following the "anything" is a comma, which means that the regex engine has to backtrack until its current position is in front of a comma. The same goes for the second .*
and the closing bracket.
The .*
pattern does one thing extremely well, and that's creating a huge amount of backtracking positions that need to be saved by the regex engine. That's why this kind of greedy matching behavior can lead to extremely poor performance when executed. Even worse, eagerly consuming that much input can result in undesired matches, as the following input shows:
Points: [x1,y1] and [x2,y2]
The values matched by the capturing groups of the above pattern are x1,y1] and [x2
and y2
, which is most likely not what you wanted to match. Because there was no restriction, .*
kept consuming input characters until the end and after that only gave up as many characters as needed to get a successful input match.
If you want to play around with this pattern a bit, feel free to use this regex fiddle.
Lazy Quantifiers to the Rescue #
The problems caused by greedy matching can easily be solved by making all quantifiers lazy, which looks like the following:
\[(.*?),(.*?)\]
"Any amount of anything" (.*?
) will then try to match as few characters as possible, attempting to match a comma (or a closing bracket, respectively) after each time.
Another solution — and the one proposed by me in the StackOverflow question — is to not use the dot at all, which minimizes the amount of required backtracking:
\[([^,\]]+),([^,\]]+)\]
After the opening bracket, this pattern tries to match as many characters that aren't ,
or ]
as possible. It then tries to match the comma, does the same thing for the second parameter, and attempts to match a closing bracket. While this pattern is slightly harder to read, it's correct and more performant than its competitor.
If you want to increase performance even further, consider employing atomic grouping, which reduces the amount of backtracking information stored by the regex engine. Be careful, though, as atomic groups likely change the set of input strings your expression will match.
The next time you're about to use .*
, please think about it carefully — chances are it won't match what you'd actually want it to.
Futher reading: