How to deal with regex? An extremely concise and definitive AI workflow.

Motivation Some time ago I had to deal with some complex regex patterns to process huge amounts of data via Apache Spark (using the pyspark API). In this process I've found a pretty simple, and yet very effective way to deal with the construction and understanding of regex patterns. I'm not a regex expert, and I don't intend to become one. Deal with regex is not something pleasant to me, neither something would like to be doing. However, sometimes it's necessary and this article is intended to show my workflow, partially powered by AI models. If you're in a similar situation, this article is for you. If you're a seasoned developer who knows regex pretty well and can immediately understand upon reading, this article is not for you. On LLMs and correctness LLMs are purely probabilistic models, having NO GUARANTEE on the pseudo information they can provide. A helpful way to understand this behavior is to think of them as some type of extremely efficient and lossy compression. To some degree is possible to trust that the information they yield is ruffly in the ballpark of the correctness, but you should never assume it actually is! Of course, this is true provided that such model been trained on this kind o information. This shortcoming can be much improved upon when using techniques that explicitly include actual relevant information on the context window of the model alongside your prompt. This can be achieved with RAGs, for instance, but this will not be discussed here. Workflow The process basically consists in two steps: Using IA (LLMs) for getting a some understanding of the patterns. Using some deterministic tool to assert that the patterns you deal with actually do what the IA assumes/intends them to do. Understanding some regex pattern that already exists In this case we use the following example: "[^0-9a-zA-Z_\-]+" Let's assume you have stumbled in this pattern on some code and wants to understand what it does. First you can use some IA model to get an overview explanation. We proceed with local LLMs, using DeepSeek R1 (14B parameters), (excluding thinking) in this article, but you can compare the results with Mistral Nemo (12B parameters) on the this link. Model results in a chat: Prompt: What the regex [^0-9a-zA-Z_\-]+ does? The regular expression [^0-9a-zA-Z_\-] is designed to match any single character that is NOT one of the following: Digits: 0 through 9 Letters: Lowercase: a through z Uppercase: A through Z Commonly used in variable names. Included as a literal character. This regex is useful for identifying characters that are outside the allowed set of digits, letters, underscores, and hyphens. For instance, it would match special characters like @, #, $, etc., but not any digit, letter, underscore, or hyphen. Example Usage: In a string "user@name123", the '@' symbol would be matched by this regex. In "Hello_World-123", none of the characters would be matched because they all fall within the allowed set. Using the results above, its possible to grasp a pretty good comprehension of how the regex pattern works and how it could be used. However, we arrive at one of biggest problem with LLMs: they DO NOT yield deterministic information. As cited in the session On LLMs and correctness, it's not possible to have any trust what so ever in this information. Hence, the need to verify each and every piece of the result. This can be done in several manners, such as testing locally with examples etc, but here we proceed using a static analysis with regex101, a free online tool tailored for this purpose. Putting the pattern in the tool, and using the python standard alongside with option /gm (global and multi line) we see: With all of this, we can confirm that the regex actually does what the model explained it does, and we can proceed to alter it or manage it however necessary. Writing some new regex pattern to perform some function Using the same kind of process, we can use the same IA models to generate regex patterns that perform some kind of function. For instance, let's say we need a pattern that recognizes white-spaces at the end of lines to be removed. Model results in a chat: Prompt: Make a regex pattern that matches any white-spaces at the end of any line in the text. To create a regex pattern that matches any whitespace characters at the end of any line, you can use: Pattern: \s+$ This pattern works as follows: \s+ : Matches one or more whitespace characters (spaces, tabs, newlines, etc.). $ : Asserts the position at the end of a line. Options: You may want to add the m (multiline) option so that the $ anchor matches the end of each line rather than just the end of the string. Example in regex with multiline mode enabled: \s+$ This will match any line that consists entirely of whitespace characters. Now we can verify the actual function of the pattern using regex101: Now

Apr 9, 2025 - 18:09
 0
How to deal with regex? An extremely concise and definitive AI workflow.

Motivation

Some time ago I had to deal with some complex regex patterns to process huge amounts of data via Apache Spark (using the pyspark API). In this process I've found a pretty simple, and yet very effective way to deal with the construction and understanding of regex patterns.

I'm not a regex expert, and I don't intend to become one. Deal with regex is not something pleasant to me, neither something would like to be doing. However, sometimes it's necessary and this article is intended to show my workflow, partially powered by AI models.

If you're in a similar situation, this article is for you. If you're a seasoned developer who knows regex pretty well and can immediately understand upon reading, this article is not for you.

On LLMs and correctness

LLMs are purely probabilistic models, having NO GUARANTEE on the pseudo information they can provide. A helpful way to understand this behavior is to think of them as some type of extremely efficient and lossy compression. To some degree is possible to trust that the information they yield is ruffly in the ballpark of the correctness, but you should never assume it actually is! Of course, this is true provided that such model been trained on this kind o information.

This shortcoming can be much improved upon when using techniques that explicitly include actual relevant information on the context window of the model alongside your prompt. This can be achieved with RAGs, for instance, but this will not be discussed here.

Workflow

The process basically consists in two steps:

  1. Using IA (LLMs) for getting a some understanding of the patterns.
  2. Using some deterministic tool to assert that the patterns you deal with actually do what the IA assumes/intends them to do.

Understanding some regex pattern that already exists

In this case we use the following example: "[^0-9a-zA-Z_\-]+"

Let's assume you have stumbled in this pattern on some code and wants to understand what it does. First you can use some IA model to get an overview explanation.

We proceed with local LLMs, using DeepSeek R1 (14B parameters), (excluding thinking) in this article, but you can compare the results with Mistral Nemo (12B parameters) on the this link.

Model results in a chat:

Prompt: What the regex [^0-9a-zA-Z_\-]+ does?

The regular expression [^0-9a-zA-Z_\-] is designed to match any single character that is NOT one of the following:

  1. Digits: 0 through 9
  2. Letters:
    • Lowercase: a through z
    • Uppercase: A through Z
  3. Commonly used in variable names.
  4. Included as a literal character.

This regex is useful for identifying characters that are outside the allowed set of digits, letters, underscores, and hyphens. For instance, it would match special characters like @, #, $, etc., but not any digit, letter, underscore, or hyphen.

Example Usage:

  • In a string "user@name123", the '@' symbol would be matched by this regex.
  • In "Hello_World-123", none of the characters would be matched because they all fall within the allowed set.

Using the results above, its possible to grasp a pretty good comprehension of how the regex pattern works and how it could be used. However, we arrive at one of biggest problem with LLMs: they DO NOT yield deterministic information.

As cited in the session On LLMs and correctness, it's not possible to have any trust what so ever in this information. Hence, the need to verify each and every piece of the result. This can be done in several manners, such as testing locally with examples etc, but here we proceed using a static analysis with regex101, a free online tool tailored for this purpose.

Putting the pattern in the tool, and using the python standard alongside with option /gm (global and multi line) we see:

Results from regex101 0

With all of this, we can confirm that the regex actually does what the model explained it does, and we can proceed to alter it or manage it however necessary.

Writing some new regex pattern to perform some function

Using the same kind of process, we can use the same IA models to generate regex patterns that perform some kind of function. For instance, let's say we need a pattern that recognizes white-spaces at the end of lines to be removed.

Model results in a chat:

Prompt: Make a regex pattern that matches any white-spaces at the end of any line in the text.

To create a regex pattern that matches any whitespace characters at the end of any line, you can use:

Pattern:
\s+$

This pattern works as follows:

  • \s+ : Matches one or more whitespace characters (spaces, tabs, newlines, etc.).
  • $ : Asserts the position at the end of a line.

Options:

  • You may want to add the m (multiline) option so that the $ anchor matches the end of each line rather than just the end of the string.

Example in regex with multiline mode enabled:

\s+$

This will match any line that consists entirely of whitespace characters.

Now we can verify the actual function of the pattern using regex101:

Results from regex101 1

Now, at first glance, the pattern seems to match everything we need, but we can be sure testing it. In doing so, we'll use the following multi line string:

test with no trailing spaces

test with one trailing space 
test with tow trailing spaces  

   test with leading and trailing spaces     
     test with some leading spaces
   test with one leading space
   test with leading and trailing spaces     

And now we replace the pattern matches with an emoji for easy visualization using regex101:

Replace 0

As can be seen above, the pattern the model suggested to us has an unfortunate side-effect: it removes blank lines from the input string. Since in the present hypothetical case this is not intended to happen, we proceed improving it with the model.

Continuation of the model results in a chat:

Prompt: The pattern is removing blank lines, how can i make it match only the characters before the end of the line?

To modify the regex to only match whitespaces at the end of a line when there are non-whitespace characters before them (ignoring blank lines), you can use:

Pattern:
^\S+\s+$

This pattern matches:

  • ^: Start of the line
  • \S+: One or more non-whitespace characters
  • \s+: One or more whitespace characters at the end of the line
  • $: End of the line

Explanation:

  • This regex ensures that there is at least one non-whitespace character before any trailing whitespace.
  • Blank lines (which only contain whitespace) will not be matched because they lack the \S+ part.

Example:
In a text like:

Hello World    <-- Matched
   (contains only spaces)
Another line    <-- Matched
Another line    <-- Not matched (blank line)

The regex ^\S+\s+$ will match lines 1 and 3 but not the blank line.

Now, checking the functionality and testing the new pattern:

Results from regex101 2

Replace 1

And indeed we can assert that the regex does exactly what is intended to do and that such behavior is guaranteed via tests.

Conclusion

Even as a firm pessimist on the majority of the use cases for IA and LLMs, still is not possible for me to ignore some of the actual good use cases for it. I believe that this post showcases some responsible and productive uses of such technology.

AI is here to stay, not as the solve-all-problems hype that it is today, but with some actual good usage cases. I hope this post has given you, the reader, some insight and maybe inspiration to find more of them.

P.S. Tests are your friends, take care of them, treat them well, and they will do the same for you. And always remember, that the Crowdstrike Incident was in part due to regex errors according to themselfs. Be careful!