In this tutorial, we’ll show you how to safeguard regular expressions against denial-of-service (DoS) attacks. We’ll review how regular expressions work in general, focusing on regular expressions that are susceptible to denial-of-service attacks, and various ways to safeguard our applications from compromise.
We’ll cover the following in detail:
To follow along with this tutorial, you should have have basic knowledge of regular expressions.
We’ll be using the Node.js runtime to run some examples later, so it is essential to have Node.js installed. If you don’t have Node.js installed locally, you can head to the official Node.js website and download the LTS version for your operating system.
ReDoS attacks are one of the many flavors of denial-of-service attacks. The main goal of a DoS attack is to make application/server resources inaccessible to end-users.
Here’s how a DoS attack works: A threat actor tries to take advantage of a vulnerability to cripple the system. For example, the attacker might send a massive barrage of requests that overwhelms the server and forces it to respond to all requests in a disproportionate amount of time. This also forces the server to use a ton of resources and could possibly cause the system to crash.
ReDoS attacks follow the same blueprint: the attacker capitalizes on specific vulnerabilities regex engines face when matching regular expressions such that it takes a disproportionate amount of time to execute that regular expression. This essentially crashes the system or stops the system from responding to user requests.
A Snyk report published in 2019 showed that ReDoS attacks are on the rise. ReDoS exploits increased by 143 percent in 2018, with Node.js apps among the most affected. Because Node’s event loop is single-threaded, such attacks aim to block the event loop, which can have devastating effects.
Before we proceed, let’s quickly review how regular expression matching works underneath the hood; this will help us understand better how and why some regular expressions are susceptible to denial-of-service attacks.
Regular expression pattern matching can be done by building a finite state machine. You can think of this as an abstract machine that takes a set of inputs and a set of operations that can be performed on that input to produce a specified output.
A finite state machine can be in exactly one of a limited number of states at any given time. A transition happens when a finite state machine changes from one state to another. An example of a finite state machine is a coffee dispenser machine that pours out a specific coffee variety based on the user’s option.
As previously stated, regular expression matching can be done by building a finite state machine. Regular expressions can also be easily converted from finite state to nondeterministic, especially for expressions where there are several possible next states for each input received.
In such cases, after the conversion, there are several algorithms the regular expression engine can use to determine the next states, but let’s focus on the most problematic algorithms:
So a Regex denial-of-service occurs when either of these two algorithms is applied to a particular regular expression. A malicious user can take advantage of this and trigger one of these two conditions, leading to the worst-case runtime complexity of the regular expression engine.
Let’s look at an example of a regular expression that is susceptible to DoS attacks. First, we need to install a tool called gnomon, a command-line utility that we’ll use to examine how long a command takes to be executed.
Head over to your terminal and run the following command:
npm install -g gnomon
We’ll focus on the first problem because that is where the more severe type of problem occurs.
Let’s say we have a pattern, /^(\w+\s?)*$/
, that takes a group of words with an optional space after each word. The quantifiers ^
and $
match the words at the beginning and end of the line.
Let’s try a group of words with no special characters:
node -p "/^(\w+\s?)*$/.test('Only valid characters')" | gnomon
We see that it matches and it took 0.0058 seconds to execute that regular expression on my terminal.
Let’s try putting together a sentence with a special character at the end of the last word:
node -p "/^(\w+\s?)*$/.test('Invalid characters!')" | gnomon
As expected, it returned false
and took about 0.0061 seconds to execute that regular expression.
Perfect, all works fine. But the problem is that it can take a very long time for the regex engine to execute the regular expression for a much longer sentence with special characters.
Let’s see that in action. Run the following in your terminal:
node -p "/^(\w+\s?)*$/.test('A long sentence with invalid characters that takes soo much time to be matched that it potentially causes our CPU usage to increase drastically!!!')" | gnomon
You shouldn’t expect a result from that command 😅. If we open our task manager, we can see that the particular process uses a tremendously high CPU percent to execute that regular expression. Essentially, we should notice a sharp increase in overall current CPU usage.
So as you can see, an attacker can exploit a seemingly simple regex pattern to cause our system to use more resources than expected, and longer inputs can cause our system to hang or crash.
Let’s take a look a more in-depth look at this why this happens:
\w+\s?
+
being greedy, it tries to find as many valid words as it can, so it returns A
long sentence with invalid characters that takes so``o
much time to be matched that it potentially causes our CPU usage to increase
(\w+\s?)*
can then be applied, but there are no more valid words in the input, so it doesn’t return anything$
quantifier in our pattern, the regex engine tries to match the end of the input. Still, we have an invalid word, drastically!!!
, so there’s no match+
decreases the count of repetitions, backtracks by one word, and tries to match the rest on the input — which, in this case is A long sentence with invalid characters that takes soo much time to be matched that it potentially causes our CPU usage to
*
quantifier can be applied again and matches the word increase
. Remember, we have the $
quantifier; the engine uses that, but it fails to match drastically!!!
againThe regex engine will backtrack again, decreasing the number of repetitions, and keep doing so until all possible paths are explored. We expect regular expression matches to take about O(n) time, where n indicates the input string length.
In most cases, this may be true. Still, in some cases — such as the case we just looked at — the regex engine might need to take an exponential number of paths through the input string to find a match.
So in the case of an input with a size of about 125, we run into a situation where the engine takes an exponential number of paths, approximately 2^125 different paths, which gives about 4.2535296e+37 different combinations, because there was an invalid word in a particular position. This typically leads to what is known as catastrophic backtracking. Such regular expressions take a tremendous amount of time and resources to execute.
Finally, we will look at various ways we can safeguard our patterns against such problems.
There are several ways to ensure that your regular expression patterns are not susceptible to denial-of-service of attacks.
One approach is to reduce the number of combinations performed by the Regex engines. There are several ways to do this:
(a+)*
(b|b)*
Depending on the engine, some regular expressions written using nested quantifiers and overlapping clauses can be executed quickly, but there’s no guarantee. It is safer to be careful.
Another approach is to control backtracking. Although backtracking enables us to construct complex and powerful regular expressions, the eventual benefits may be irrelevant, especially when compared to the poor performance in cases such as the ones we examined previously.
Thankfully, we can use certain features to either limit or suppress backtracking and still create powerful regular expressions. Let’s take a look at two: atomic groups and lookahead.
An atomic group uses the ?>
syntax to suppress backtracking into the expression. Once a match is found, it does not allow any other parts to be subject to backtracking, even if it means that there’s a possibility of a successful match.
This method of suppressing backtracking helps improve performance when using nested quantifiers. Unfortunately, this feature is not implemented by all regex engines and notably isn’t available in JavaScript/Node.js.
Let’s look at another feature that enables us to do a similar thing and is available in JavaScript/Node.js.
Using the example we saw previously, we would like our quantifier not to backtrack because, for the most part, backtracking can lead to severe issues, as we saw previously. We can take advantage of a feature called lookahead to enforce that.
When using lookahead assertions, we use the syntax ?=
— e.g., for a pattern A(?=B)
, it simply says, “Look for A, but match it if only it is followed by B.” This is important because we can determine whether the expression can match the characters that come next without backtracking or advancing.
In this case, we would like to match as many words as possible without backtracking. We can rewrite the pattern that matches words from \w+
to (?=(\w+))\1
. It may seem a bit unintuitive at first glance, but let’s break it down.
In our rewritten pattern, (?=(\w+))\1
, we tell the engine to look for the longest word at the current position. The pattern in the inner parentheses, (\w+)
, tells the engine to memorize the contents, and we can use \1
to reference it later on.
This solves our issue because we can use the lookahead feature to match the word w+
as a whole and reference it using the pattern \1
. Essentially, we can implement a possessive +
quantifier that must match the whole word and not some parts.
In our first example, the pattern specified captures the words, but when it comes across an invalid word, the +
quantifier forces it to backtrack until it succeeds or fails. In our rewritten example, we used lookahead to find a valid word, which is matched as a whole and included in the pattern using \1
.
Let’s run this new pattern together with our previous quantifiers and see if we get the same problem:
node -p "/^((?=(\w+))\1\s?)*$/.test('A long sentence with invalid characters but doesnt cause our CPU usage to increase drastically!!!')" | gnomon
Voila!, we can see that the regular expression is executed, and we receive an output instantly; it took about 0.0052 seconds to get a result.
In this tutorial, we learned how to safeguard regular expressions from denial-of-service attacks. We dove deeper to see how regular expression matching works, which enabled us to understand why and how this problem even occurs. We then looked at an example of a regular expression pattern with such a vulnerability and demonstrated ways to block loopholes DoS attackers may exploit.
Install LogRocket via npm or script tag. LogRocket.init()
must be called client-side, not
server-side
$ npm i --save logrocket // Code: import LogRocket from 'logrocket'; LogRocket.init('app/id');
// Add to your HTML: <script src="https://cdn.lr-ingest.com/LogRocket.min.js"></script> <script>window.LogRocket && window.LogRocket.init('app/id');</script>
Hey there, want to help make our blog better?
Join LogRocket’s Content Advisory Board. You’ll help inform the type of content we create and get access to exclusive meetups, social accreditation, and swag.
Sign up nowLearn how to implement one-way and two-way data binding in Vue.js, using v-model and advanced techniques like defineModel for better apps.
Compare Prisma and Drizzle ORMs to learn their differences, strengths, and weaknesses for data access and migrations.
It’s easy for devs to default to JavaScript to fix every problem. Let’s use the RoLP to find simpler alternatives with HTML and CSS.
Learn how to manage memory leaks in Rust, avoid unsafe behavior, and use tools like weak references to ensure efficient programs.
One Reply to "How to protect against regex denial-of-service (ReDoS) attacks"
Interesting article.
Your explanation is wrong though. \w+\s* does not return “A long sentence with invalid characters that takes so much time to be matched that it potentially causes our CPU usage to increase”. it matches “A “, because \w is only a single char, so \w+ matches as many word char are available (in this case just the letter A), then \s* matches as many spaces as possible (just one in this case), the result is “A “. then (\w+\s*)* matches the whole string. It matches as many “at least one word char followed by 0 or more space”. The rest of your explanation is therefore erroneous.
Too bad also your solution is not a real solution. It rejects rapidly the sequence with invalid chars, but it also reject any sequence with valid char ! In fact, this formula will never match anything but the empty string. This is due to the fact that you reference the 1st group from within the first group (the \1 is within the first pair of ()). If you define the first group as “The first group is the first group plus the repetition of itself”, the only solution is the empty group.
A solution that works to you problem is “an optional blank separated list of words plus one word” and it’s spelled like this :
/^(\w+\s+)*\w+$/
which can be decoded as :
^: start
(…)* repeat 0 or more time
\w+: at least one word char
\s+: at least one space char :
\w+: followed by at least one word char
$: then end
It instantly matches “correct”
it instantly matches “this is a list of word”
it instantly does not match “this is an invalid list!”
it instantly does not match “A long sentence with invalid characters that takes soo much time to be matched that it potentially causes our CPU usage to increase drastically!!!”