Simulating negative-lookahead in regular expressions

Task: rewrite a regular expression with negative-lookahead to a regular expression without any negative-lookahead, while matching the same strings.

Algorithm in pseudo code:

Write an expression RE with negative lookahead groups
For all negative lookahead groups G in the RE, enumerate all possible matches
   For all possible prefixes P of G (including zero-width, excluding the complete string)
        C is the next character after P in the original string G
        yield P[^C]
        yield P$
   Concat all yielded parts with | as ENUM
   Replace G in the original RE with (ENUM)

The algorithm works by replacing the negative-lookahead with a match of all strings but those the lookahead would discard (the full complementary set). There are certain nuances such as lookahead not actually consuming the string and leaving the engine cursor at before the match. For many use cases this is negligible, but it means the transformation is not correct all the time.

Example and explanation 🔗︎

Match all branches that are not prefixed with deploy/.

^(?!deploy/)\S*$ (dbg)

This regular expression only contains a single negative-lookahead deploy/ (variable G)

All possible prefixes P and next chars C in this G are

P C
(empty) d
d e
de p
dep l
depl o
deplo y
deploy /

yield P[^C] generates:

  • [^d\s] match all strings that don’t begin with d (which means the string cannot start with deploy/)
  • d[^e\s] even if the first character is d, match everything not prefixed with de (again, cannot match deploy/)
  • de[^p\s] even if the string is prefixed with de, match everything not prefixed with dep (ditto)
  • dep[^l\s]
  • depl[^o\s]
  • deplo[^y\s]
  • deploy[^/\s] even if the string is prefixed with deploy, match everything not prefixed with deploy/ (matches deployer, does not match deploy/foo)

The inverse set is almost complete, we only omitted all prefixes of the original string. We must also include those (the string must not contain additional characters after this match as denoted by $). This is why we also yield P$:

  • d$
  • de$
  • dep$
  • depl$
  • deplo$
  • deploy$

All of this concatenates to

[^d\s]|d[^e\s]|de[^p\s]|dep[^l\s]|depl[^o\s]|deplo[^y\s]|deploy[^/\s]|d$|de$|dep$|depl$|deplo$|deploy$`.

Plugging it into the original regex yields the final regular expression (dbg):

^([^d\s]|d[^e\s]|de[^p\s]|dep[^l\s]|depl[^o\s]|deplo[^y\s]|deploy[^/\s]|d$|de$|dep$|depl$|deplo$|deploy$)\S*$

Use case 🔗︎

GitLab CI did not support negative expressions at all until version 11.11. Negative lookahead is not supported by the regex engine that GitLab uses for security reasons. If it did, lookahead would be an obvious and still readable choice for implementing negative expressions.

I still occasionally come over older deployment and even though we urge owners to upgrade to a maintained version, it takes time to plan and execute. Since the CI jobs need to be configured in the mean time, I’ve come up with an algorithm that rewrites negative lookahead to a basic pattern.