Character Groups
RegEx engines already define some groups of characters that can make writing RegEx expressions quicker.
Anchors
^
is used to assert the beginning of a line in multi-line mode, or the beginning of the string in whole-string mode.
$
is used to assert the end of a line in multi-line mode, or the end of the string in whole-string mode.
The behaviours of these depend on the match options
Greedy VS Lazy
Some combinators will either match “lazy”, or “greedy”.
Lazy is when the engine only matches as many characters required to get to the next step. This should almost always be used.
Greedy matching is when the engine tries to match as many characters as possible. The problem with this is that it might cause “backtracking”, which happens when the engine goes back in the pattern multiple times to ensure that as many characters as possible where matched. This can cause big performance issues.
Chain
When two expressions are next to each other, they will be chained together, which means that both will be evaluated in-order.
Example: x\d
matches a x
and then a digit, like for example x9
Or
Two expressions separated by a |
cause the RegEx engine to first try to match the left side, and only if it fails, it tries the right side instead.
Note that “or” has a long left and right scope, which means that ab|cd
will match either ab
or cd
Or-Not
Tries to match the expression on the left to it, but won’t error if it doesn’t succeed.
Note that “or-not” has a short left scope, which means that ab?
will always match a
, and then try to match b
Repeated
A expression followed by either a *
for greedy repeat, or a *?
for lazy repeat.
This matches as many times as possible, but can also match the pattern zero times.
Note that this has a short left scope.
Repeated At Least Once
A expression followed by either a +
for greedy repeat, or a +?
for lazy repeat.
This matches as many times as possible, and at least one time.
Note that this has a short left scope.
(Non-Capture) Group
Groups multiple expressions together for scoping.
Example: (?:abc)
will just match abc
Capture Group
Similar to Non-Capture Groups except that they capture the matched text. This allows the matched text of the inner expression to be extracted later.
Capture group IDs are enumerated from left to right, starting with 1.
Example: (abc)de
will match abcde
, and store abc
in group 1.
Character Set
By surrounding multiple characters in square brackets, the engine will match any of them. Special characters or expressions won’t be parsed inside them, which means that this can also be used to escape characters.
For example: [abc]
will match either a
, b
or c
.
and [ab(?:c)]
will match either a
, b
, (
, ?
, :
, c
, or )
.
Character groups and escaped characters still work inside character sets.
Character sets can also contain ranges. For example: [0-9a-z]
will match either any digit, or any lowercase letter.
Conclusion
RegEx is perfect for when you just want to match some patterns, but the syntax can make patterns very hard to read or modify.
In the next article, we will start to dive into implementing RegEx.
Stay tuned!