Sanitizing input with regex considered harmful
Sanitizing input (as in trying to remove a subset of user input so that the remaining parts become “safe”) is hard to get right in itself. However, many developers doom their protection in the first place by choosing the wrong tool to get it done, in this case, regular expressions (regex for short). While they’re powerful for quite a few purposes, as the old saying goes,
Some people, when confronted with a problem, think “I know, I’ll use regular expressions.” Now they have two problems.
During a recent pentest, we found an application that did this by stripping HTML tags from a string by replacing the regular expression <.*?>
with an empty string. (Apparently, they haven’t read the best reaction to processing HTML with regexes.) For those wondering about the question mark after the star, it disables the default greedy behavior of the engine, so the expression matches a less-than sign, as few characters as possible of any kind, and a greater-than sign. At first sight, one might think that’s the definition of an HTML tag, and for a minute we also believed it was the case.
In regexes, the dot matches any character. However, the definition of any excludes newlines (ASCII 0x0a
, \n
) by default in most implementations, while the HTML standard allows for such characters inside tags, which gives us a specific class of tags that are valid in browsers but are not stripped by the above algorithm. Below are some examples of platforms used to implement web applications and their behavior regarding this “challenge”. Some libraries have similar solutions, but only one thing was common in these five languages; by default, the above expression fails the test. For the sake of brevity and readability, examples were produced in interactive shells (REPLs); in case of Java and .NET, Jython and IronPython were used, respectively.
Java
>>> from java.util.regex import Pattern
>>> p = Pattern.compile('<.*?>')
>>> p.matcher('<foobar>').replaceAll('')
u''
>>> p.matcher('<foo\nbar>').replaceAll('')
u'<foo\nbar>'
The official documentation states that dot matches “any character (may or may not match line terminators)”. The link points to a section that says “The regular expression . matches any character except a line terminator unless the DOTALL flag is specified.” [emphasis added] Adding the flag solves the problem, as it can be seen below.
>>> p = Pattern.compile('<.*?>', Pattern.DOTALL)
>>> p.matcher('<foobar>').replaceAll('')
u''
>>> p.matcher('<foo\nbar>').replaceAll('')
u''
Python
>>> import re
>>> re.sub('<.*?>', '', '<foobar>')
''
>>> re.sub('<.*?>', '', '<foo\nbar>')
'<foo\nbar>'
Python follows a similar path, even the flag is called the same: “In the default mode, this matches any character except a newline. If the DOTALL flag has been specified, this matches any character including a newline” [emphasis added].
>>> re.sub('<.*?>', '', '<foobar>', flags=re.DOTALL)
''
>>> re.sub('<.*?>', '', '<foo\nbar>', flags=re.DOTALL)
''
PHP
php > var_dump(preg_replace("/<.*?>/", "", "<foobar>"));
string(0) ""
php > var_dump(preg_replace("/<.*?>/", "", "<foo\nbar>"));
string(9) "<foo
bar>"
Although PHP has an interactive mode (php -a
), return values are silently discarded, and var_dump
doesn’t escape newlines. However, it clearly illustrates that it behaves just like the others, but PHP doesn’t mention this behavior in the official manual for preg_replace (even though a user comment points it out, it lacks the solution). The PCRE modifiers page has the answer, the s
modifier should be used, and it even shows the longer name for it (PCRE_DOTALL
), although there’s no way to use it, in contrast with Python’s solution (re.S
is equivalent to re.DOTALL
).
php > var_dump(preg_replace("/<.*?>/s", "", "<foobar>"));
string(0) ""
php > var_dump(preg_replace("/<.*?>/s", "", "<foo\nbar>"));
string(0) ""
.NET
>>> from System.Text.RegularExpressions import Regex
>>> Regex.Replace('<foobar>', '<.*?>', '')
''
>>> Regex.Replace('<foo\nbar>', '<.*?>', '')
'<foo\nbar>'
Of course, Microsoft surprises noone by having its own solution for the problem. In their documentation on regexes, they also mention that dot “matches any single character except \n
”, but you have to figure it out yourself; there’s no link to the Singleline
member of RegexOptions.
>>> from System.Text.RegularExpressions import RegexOptions
>>> r = Regex('<.*?>', RegexOptions.Singleline)
>>> r.Replace('<foobar>', '')
''
>>> r.Replace('<foo\nbar>', '')
''
Ruby
irb(main):001:0> "<foobar>".sub!(/<.*?>/, "")
=> ""
irb(main):002:0> "<foo\nbar>".sub!(/<.*?>/, "")
=> nil
Ruby performs as usual, having easy-to-write/hard-to-read shorthands, however, its solution is almost as dumbfounding as the above. Like PHP, it expects modifiers as lowercase characters after the trailing slash (/
), but it interprets s
as a signal to interpret the regex as SJIS encoding (I never knew it even existed), and wants you to use m
(called MULTILINE by the official documentation, adding to the confusion), which is used for other purposes in other regular expression engines.
irb(main):005:0> "<foobar>".sub!(/<.*?>/m, "")
=> ""
irb(main):004:0> "<foo\nbar>".sub!(/<.*?>/m, "")
=> ""
Javascript
> "<foobar>".replace(/<.*?>/, "")
''
> "<foo\nbar>".replace(/<.*?>/, "")
'<foo\nbar>'
> "<foo\nbar>".replace(/<.*?>/m, "")
'<foo\nbar>'
JavaScript has three modifiers (igm
), none of them useful for making dot match literally any character. The only solution is to do it explicitly, the best one of these seems to be matching the union of whitespace and non-whitespace characters.
> "<foo\nbar>".replace(/<[\s\S]*?>/, "")
''
> "<foobar>".replace(/<[\s\S]*?>/, "")
''
Conclusion
The above solutions address a single problem only (stripping HTML tags having line breaks), processing untrusted input is much more than this. If you build a web application that must display such content, use a proper library for this purpose, preferably a templating language that performs escaping by default. For other purposes, use a DOM and don’t forget to test for corner cases, including both valid and broken HTML.