Simulating negative-lookahead in regular expressions
Task: rewrite a regular expression with negative-lookahead to a regular expression without any negative-lookahead, while matching the same strings.
Algorithm in pseudo code:
Write an expression RE with negative lookahead groups For all negative lookahead groups G in the RE, enumerate all possible matches For all possible prefixes P of G (including zero-width, excluding the complete string) C is the next character after P in the original string G yield P[^C] yield P$ Concat all yielded parts with | as ENUM Replace G in the original RE with (ENUM)
The algorithm works by replacing the negative-lookahead with a match of all strings but those the lookahead would discard (the full complementary set). There are certain nuances such as lookahead not actually consuming the string and leaving the engine cursor at before the match. For many use cases this is negligible, but it means the transformation is not correct all the time.
Example and explanation 🔗︎
Match all branches that are not prefixed with
This regular expression only contains a single negative-lookahead
All possible prefixes
P and next chars
C in this
yield P[^C] generates:
[^d\s]match all strings that don’t begin with
d(which means the string cannot start with
d[^e\s]even if the first character is
d, match everything not prefixed with
de(again, cannot match
de[^p\s]even if the string is prefixed with
de, match everything not prefixed with
deploy[^/\s]even if the string is prefixed with
deploy, match everything not prefixed with
deployer, does not match
The inverse set is almost complete, we only omitted all prefixes of the original string. We must also include those (the string must not contain additional characters after this match as denoted by
$). This is why we also
All of this concatenates to
Plugging it into the original regex yields the final regular expression (dbg):
Use case 🔗︎
GitLab CI did not support negative expressions at all until version 11.11. Negative lookahead is not supported by the regex engine that GitLab uses for security reasons. If it did, lookahead would be an obvious and still readable choice for implementing negative expressions.
I still occasionally come over older deployment and even though we urge owners to upgrade to a maintained version, it takes time to plan and execute. Since the CI jobs need to be configured in the mean time, I’ve come up with an algorithm that rewrites negative lookahead to a basic pattern.