In this tutorial, we’ll show you how to safeguard regular expressions against denial-of-service (DoS) attacks. We’ll review how regular expressions work in general, focusing on regular expressions that are susceptible to denial-of-service attacks, and various ways to safeguard our applications from compromise.
We’ll cover the following in detail:
- What is regular expression denial-of-service (ReDoS)?
- How do regular expressions work?
- What types of regex are susceptible to DOS attacks?
- How to protect regular expressions against ReDoS attacks
To follow along with this tutorial, you should have have basic knowledge of regular expressions.
We’ll be using the Node.js runtime to run some examples later, so it is essential to have Node.js installed. If you don’t have Node.js installed locally, you can head to the official Node.js website and download the LTS version for your operating system.
What is regular expression denial-of-service (ReDoS)?
ReDoS attacks are one of the many flavors of denial-of-service attacks. The main goal of a DoS attack is to make application/server resources inaccessible to end-users.
Here’s how a DoS attack works: A threat actor tries to take advantage of a vulnerability to cripple the system. For example, the attacker might send a massive barrage of requests that overwhelms the server and forces it to respond to all requests in a disproportionate amount of time. This also forces the server to use a ton of resources and could possibly cause the system to crash.
ReDoS attacks follow the same blueprint: the attacker capitalizes on specific vulnerabilities regex engines face when matching regular expressions such that it takes a disproportionate amount of time to execute that regular expression. This essentially crashes the system or stops the system from responding to user requests.
A Snyk report published in 2019 showed that ReDoS attacks are on the rise. ReDoS exploits increased by 143 percent in 2018, with Node.js apps among the most affected. Because Node’s event loop is single-threaded, such attacks aim to block the event loop, which can have devastating effects.
How do regular expressions work?
Before we proceed, let’s quickly review how regular expression matching works underneath the hood; this will help us understand better how and why some regular expressions are susceptible to denial-of-service attacks.
Regular expression pattern matching can be done by building a finite state machine. You can think of this as an abstract machine that takes a set of inputs and a set of operations that can be performed on that input to produce a specified output.
A finite state machine can be in exactly one of a limited number of states at any given time. A transition happens when a finite state machine changes from one state to another. An example of a finite state machine is a coffee dispenser machine that pours out a specific coffee variety based on the user’s option.
As previously stated, regular expression matching can be done by building a finite state machine. Regular expressions can also be easily converted from finite state to nondeterministic, especially for expressions where there are several possible next states for each input received.
In such cases, after the conversion, there are several algorithms the regular expression engine can use to determine the next states, but let’s focus on the most problematic algorithms:
- The engine tries all the possible paths until a match is found or all the routes are tried and failed (this is called backtracking). This is problematic because you have an exponential number of paths n being taken for an input of length n, so worst case, you get the results in exponential time
- The engine tries to convert it again from nondeterministic automation to deterministic automation. This is problematic because, depending on the execution path, the conversion can take exponential time to finish
So a Regex denial-of-service occurs when either of these two algorithms is applied to a particular regular expression. A malicious user can take advantage of this and trigger one of these two conditions, leading to the worst-case runtime complexity of the regular expression engine.
What types of regex are susceptible to DOS attacks?
Let’s look at an example of a regular expression that is susceptible to DoS attacks. First, we need to install a tool called gnomon, a command-line utility that we’ll use to examine how long a command takes to be executed.
Head over to your terminal and run the following command:
npm install -g gnomon
We’ll focus on the first problem because that is where the more severe type of problem occurs.
Let’s say we have a pattern,
/^(\w+\s?)*$/, that takes a group of words with an optional space after each word. The quantifiers
$ match the words at the beginning and end of the line.
Let’s try a group of words with no special characters:
node -p "/^(\w+\s?)*$/.test('Only valid characters')" | gnomon
We see that it matches and it took 0.0058 seconds to execute that regular expression on my terminal.
Let’s try putting together a sentence with a special character at the end of the last word:
node -p "/^(\w+\s?)*$/.test('Invalid characters!')" | gnomon
As expected, it returned
false and took about 0.0061 seconds to execute that regular expression.
Perfect, all works fine. But the problem is that it can take a very long time for the regex engine to execute the regular expression for a much longer sentence with special characters.
Let’s see that in action. Run the following in your terminal:
node -p "/^(\w+\s?)*$/.test('A long sentence with invalid characters that takes soo much time to be matched that it potentially causes our CPU usage to increase drastically!!!')" | gnomon
You shouldn’t expect a result from that command 😅. If we open our task manager, we can see that the particular process uses a tremendously high CPU percent to execute that regular expression. Essentially, we should notice a sharp increase in overall current CPU usage.
So as you can see, an attacker can exploit a seemingly simple regex pattern to cause our system to use more resources than expected, and longer inputs can cause our system to hang or crash.
Let’s take a look a more in-depth look at this why this happens:
- The leading cause of this problem is a feature available in regex engines called backtracking. The engine first goes through the input and tries to match the content contained in parentheses
- Due to the quantifier
+being greedy, it tries to find as many valid words as it can, so it returns
long sentence with invalid characters that takes so``o
much time to be matched that it potentially causes our CPU usage to increase
- The star quantifier
(\w+\s?)*can then be applied, but there are no more valid words in the input, so it doesn’t return anything
- Due to the
$quantifier in our pattern, the regex engine tries to match the end of the input. Still, we have an invalid word,
drastically!!!, so there’s no match
- The engine moves one step back to the previous position and tries to take a different path, hoping to find a match. Hence, the quantifier
+decreases the count of repetitions, backtracks by one word, and tries to match the rest on the input — which, in this case is
A long sentence with invalid characters that takes soo much time to be matched that it potentially causes our CPU usage to
- The engine then continues its search from the following position: the
*quantifier can be applied again and matches the word
increase. Remember, we have the
$quantifier; the engine uses that, but it fails to match
The regex engine will backtrack again, decreasing the number of repetitions, and keep doing so until all possible paths are explored. We expect regular expression matches to take about O(n) time, where n indicates the input string length.
In most cases, this may be true. Still, in some cases — such as the case we just looked at — the regex engine might need to take an exponential number of paths through the input string to find a match.
More great articles from LogRocket:
- Don't miss a moment with The Replay, a curated newsletter from LogRocket
- Learn how LogRocket's Galileo cuts through the noise to proactively resolve issues in your app
- Use React's useEffect to optimize your application's performance
- Switch between multiple versions of Node
- Discover how to animate your React app with AnimXYZ
- Explore Tauri, a new framework for building binaries
- Compare NestJS vs. Express.js
So in the case of an input with a size of about 125, we run into a situation where the engine takes an exponential number of paths, approximately 2^125 different paths, which gives about 4.2535296e+37 different combinations, because there was an invalid word in a particular position. This typically leads to what is known as catastrophic backtracking. Such regular expressions take a tremendous amount of time and resources to execute.
Finally, we will look at various ways we can safeguard our patterns against such problems.
How to protect regular expressions against DoS attacks
There are several ways to ensure that your regular expression patterns are not susceptible to denial-of-service of attacks.
Reduce the number of combinations
One approach is to reduce the number of combinations performed by the Regex engines. There are several ways to do this:
- Avoid using nested quantifiers — e.g.,
- Avoid ORs with overlapping clauses — e.g.,
Depending on the engine, some regular expressions written using nested quantifiers and overlapping clauses can be executed quickly, but there’s no guarantee. It is safer to be careful.
Another approach is to control backtracking. Although backtracking enables us to construct complex and powerful regular expressions, the eventual benefits may be irrelevant, especially when compared to the poor performance in cases such as the ones we examined previously.
Thankfully, we can use certain features to either limit or suppress backtracking and still create powerful regular expressions. Let’s take a look at two: atomic groups and lookahead.
An atomic group uses the
?> syntax to suppress backtracking into the expression. Once a match is found, it does not allow any other parts to be subject to backtracking, even if it means that there’s a possibility of a successful match.
Using the example we saw previously, we would like our quantifier not to backtrack because, for the most part, backtracking can lead to severe issues, as we saw previously. We can take advantage of a feature called lookahead to enforce that.
When using lookahead assertions, we use the syntax
?= — e.g., for a pattern
A(?=B), it simply says, “Look for A, but match it if only it is followed by B.” This is important because we can determine whether the expression can match the characters that come next without backtracking or advancing.
In this case, we would like to match as many words as possible without backtracking. We can rewrite the pattern that matches words from
(?=(\w+))\1. It may seem a bit unintuitive at first glance, but let’s break it down.
In our rewritten pattern,
(?=(\w+))\1, we tell the engine to look for the longest word at the current position. The pattern in the inner parentheses,
(\w+), tells the engine to memorize the contents, and we can use
\1 to reference it later on.
This solves our issue because we can use the lookahead feature to match the word
w+ as a whole and reference it using the pattern
\1. Essentially, we can implement a possessive
+ quantifier that must match the whole word and not some parts.
In our first example, the pattern specified captures the words, but when it comes across an invalid word, the
+ quantifier forces it to backtrack until it succeeds or fails. In our rewritten example, we used lookahead to find a valid word, which is matched as a whole and included in the pattern using
Let’s run this new pattern together with our previous quantifiers and see if we get the same problem:
node -p "/^((?=(\w+))\1\s?)*$/.test('A long sentence with invalid characters but doesnt cause our CPU usage to increase drastically!!!')" | gnomon
Voila!, we can see that the regular expression is executed, and we receive an output instantly; it took about 0.0052 seconds to get a result.
In this tutorial, we learned how to safeguard regular expressions from denial-of-service attacks. We dove deeper to see how regular expression matching works, which enabled us to understand why and how this problem even occurs. We then looked at an example of a regular expression pattern with such a vulnerability and demonstrated ways to block loopholes DoS attackers may exploit.
LogRocket: Full visibility into your web and mobile apps
LogRocket is a frontend application monitoring solution that lets you replay problems as if they happened in your own browser. Instead of guessing why errors happen, or asking users for screenshots and log dumps, LogRocket lets you replay the session to quickly understand what went wrong. It works perfectly with any app, regardless of framework, and has plugins to log additional context from Redux, Vuex, and @ngrx/store.