All posts by dnet

Not so unique snowflakes

When faced with the problem of identifying entities, most people reach for incremental IDs. Since this requires a central actor to avoid duplicates and can be easily guessed, many solutions depend on UUIDs or GUIDs (universally / globally unique identifiers). However, although being unique solves the first problem, it doesn’t necessarily cover the second. We’ll present our new solution for detecting such issues in web projects in the form of an extension for Burp Suite Pro below.

Continue reading Not so unique snowflakes

Beyond detection: exploiting blind SQL injections with Burp Collaborator

It’s been a steady trend that most of our pentest projects revolve around web applications and/or involve database backends. The former part is usually made much easier by Burp Suite, which has a built-in scanner capable of identifying (among others) injections regarding latter. However, detection is only half of the work needed to be done; a good pentester will use a SQL injection or similar database-related security hole to widen the coverage of the test (obviously within the project scope). Burp continually improves its scanning engine but provides no means to this further exploitation of these vulnerabilities, so in addition to manual testing, most pentesters use standalone tools. With the new features available since Burp Suite 1.7.09, we’ve found a way to combine the unique talents of Burp with our database exploitation framework, resulting in pretty interesting functionality.

Continue reading Beyond detection: exploiting blind SQL injections with Burp Collaborator

Accessing local variables in ProGuarded Android apps

Debugging applications without access to the source code always has its problems, especially with debuggers that were built with developers in mind, who obviously don’t have this restriction. In one of our Android app security projects, we had to attach a debugger to the app to step through heavily obfuscated code.

Continue reading Accessing local variables in ProGuarded Android apps

Detecting ImageTragick with Burp Suite Pro

After ImageTragick (CVE-2016–3714) was published, we immediately started thinking about detecting it with Burp, which we usually use for web application testing. Although collaborator would be a perfect fit, as image processing can happen out-of-band, there’s no official way to tap into that functionality from an extension.

The next best thing is timing, where we try to detect remote code execution by injecting the sleep command which delays execution for a specified amount of seconds. By measuring the time it takes to serve a response without and the with the injected content, the difference tells us whether the code actually got executed by the server.

We used rce1.jpg from the ImageTragick PoC collection and modified it to fit our needs. By calling System.nanoTime() before and after the requests and subtracting the values, the time it took for the server to respond could be measured precisely.

Since we already had a Burp extension for image-related issues, this was modified to include an active scan option that detects ImageTragick. The JPEG/PNG/GIF detection part was reused so that it could detect if any parameters contain images, and if so, it replaces each (one at a time) with the modified rce1.jpg payload. The code was released as v0.3 and can be downloaded either in source format (under MIT license) or a compiled JAR for easier usage. Below is an example of a successful detection:

issue screenshot

Header image © Tomas Castelazo, www.tomascastelazo.com / Wikimedia Commons / CC-BY-SA-3.0

iOS HTTP cache analysis for abusing APIs and forensics

We’ve tested a number of iOS apps in the last few years, and got to the conclusion that most developers follow the recommendation to use APIs already in the system – instead of reinventing the wheel or unnecessarily depending on third party libraries. This affects HTTP backend APIs as well, and quite a few apps use the built-in NSURLRequest class to handle HTTP requests.

However, this results in a disk cache being created, with a similar structure to the one Safari uses. And if the server doesn’t set the appropriate Cache-Control headers this can result in sensitive information being stored in a plaintext database.

Like others in the field of smartphone app security testing, we’ve also discovered such databases within the sandbox and included it in the report as an issue. However, it can also be helpful for further analysis involving the API and for forensic purposes. Still, there were no ready to use tools, which is problematic in such a convoluted format.

The cache can usually be found in Library/Caches/[id]/Cache.db where [id] is application-specific, and is a standard SQLite 3 database, as it can be seen below.

$ sqlite3 Cache.db
SQLite version 3.12.1 2016-04-08 15:09:49
Enter ".help" for usage hints.
sqlite> .tables
cfurl_cache_blob_data       cfurl_cache_response
cfurl_cache_receiver_data   cfurl_cache_schema_version

Within these tables, all the information can be found that can be used to reconstruct the requests issued by the app along with the responses. (Well, almost; in practice, the lack of HTTP version and status text is not a big problem.)

Since we use Burp Suite for HTTP-related projects (web applications and SOAP/REST APIs), an obvious solution was to develop a Burp plugin that could read such a database and present the requests and responses within Burp for analysis and using it in other modules such as Repeater, Intruder or Scanner.

As the database is an SQLite one, the quest began with choosing a JDBC driver that supports it; SQLiteJDBC seemed to be a good choice, however it uses precompiled binaries for some platforms, which limits its compatibility. After the first few tests it also became apparent that quite a few parts of JDBC is not implemented, including the handling of BLOBs (raw byte arrays, optimal choice for storing complex structures not designed for direct human consumption). The quick workaround was to use HEX(foo) which results in a hexadecimal string of the blob foo, and then parsing it in the client.

BLOBs were used for almost all purposes; request and response bodies were stored verbatim (although without HTTP Content Encoding applied, see later), while request and response metadata like headers and the HTTP verb used were serialized into binary property lists, a format common on Apple systems. For the latter, we needed to find a parser, which was made harder by the fact that most solutions (be it code or forum responses) expected the XML-based representation (which is trivial to handle in any language) while in this case the more compact binary form was used. Although there are utilities to convert between these two (plutil, plistutil and others), I didn’t want to add an external command line dependencies and spawn several processes for every request.

Fortunately, I found a project called Quaqua that had a class for parsing the binary format. Although it also tried converting the object tree to the XML format, a bit of modification fixed this as well.

With these in place, I could easily convert the metadata to HTTP headers, and append the appropriate bodies (if present). For UI, I got inspiration from Logger++ but used a much simpler list for enumerating the requests, since I wanted a working prototype first. (Pull requests regarding this are welcome!)

Most of the work was solving small quirks, for example as I mentioned, HTTP Content Encoding (such as gzip) was stripped before saving the body, however the headers referred to the encoded payload, so both the Content-Length and the Content-Encoding headers needed to be removed, and former had to be filled based on the decoded (“unencoded”) body.

Below is a screenshot of the plugin in action, some values had been masked to protect the innocent.

screenshot of the plugin

The source code is available on GitHub under MIT license, with pre-built JAR binaries downloadable from the releases page.

Featured image is Apfelteiler by Frank C. Müller licensed under CC BY-SA

You’re not looking at the big picture

When serving image assets, many web developers find it useful to have a feature that scales the image to a size specified in a URL parameter. After all, bandwidth is expensive, latency is killing the mobile web, and letting the frontend guys link to avatar.php?width=64&height=64 pretty straightforward and convenient. However, solutions with those latter two qualities usually have a hard time with security.

As most readers already know and/or figured it out by now, such functionality can not only be used for scaling images down but also making them huge. This usually results in hogging lots of resources on the server, including RAM (pixel buffers), CPU (image transformation algorithms) and sometimes even disk (caching, temporary files). So in most cases, this leads to Denial of Service (DoS), which affects availability but not confidentiality; however, with most issues like this, it can be combined with other techniques to escalate it further.

During our assessments, we’ve found these DoS issues in many applications, including those used in banks and other financial institutions. Even security-minded developers need to think really hard to consider such an innocent feature as something that should be handled with care. On the other hand, detecting the issue manually is not hard, however it’s something that’s easy to miss, especially if the HTTP History submodule of the Burp Suite Proxy is configured to hide image responses as visual clutter.

To solve this, we’ve developed a Burp plugin that can be loaded into Extender, and passively detects if the size of an image reply is included in the request parameters. The source code is available on GitHub under MIT license, with pre-built JAR binaries downloadable from the releases page. It currently recognizes JPEG, PNG and GIF content, and parameters are parsed using Burp’s built-in helpers.

Since dynamically generated web content often has ill-defined Content-Type values, this plugin checks if there’s at least 12 bytes of payload in the response, and if so, the first four bytes are used to decide which parser should be started for one of the three image formats above. As the plugin is only interested in the size of the image, instead of using a full-fledged parser, a simpler (and hopefully faster and more robust) built-in solution is used that tries to be liberal while parsing the image. If the dimensions of the payload could be extracted, the request is analyzed as well to get the parameters. The current version checks if both the width and the height is included in the request, and if so, the following issue is generated.

Burp finding

The plugin in its current form is quite usable, it’s passive behavior means that just by going through a site, all such image rescaling script instances will appear in the list of issues (if the default setting is used, where every request made through the proxy is fed to passive scanning). Future development might include adding an active verification component, but it’s not trivial as this class of vulnerability by design means that a well-built request might grind the whole application to a halt.

We encourage everyone to try the plugin, pull requests are welcome on GitHub!

Testing stateful web application workflows

SANS Institute accepted my GWAPT Gold Paper about testing stateful web application workflows, the paper is now published in the Reading Room.

The paper introduces the problem we’ve been facing more and more while testing complex web applications, and shows two working solutions. Burp Suite is known by most and used by many professionals in this field, so its GUI-based features are presented first. But as Burp is far from a one-size-fits-all perfect solution, an alternative is shown combining mitmproxy and commix – a dynamic duo that can not only detect but also exploit the issues. To make things easier to demonstrate (and possibly replicate and improve by readers), an intentionally vulnerable web application was developed that (unlike the aforementioned complex apps) requires minimal effort to deploy, lowering the bar for developing tools that can be used later in enterprise environment.

The full Gold Paper can be downloaded from the website of SANS Institute:

Testing stateful web application workflows

The accompanying code is available on GitHub.

Proxying nonstandard HTTPS traffic

Depending on the time spent in IT, most professionals have seen an instance of two where developers based their implementations on specific quirks and other non-standard behaviors, a well-known example is greylisting, another oft-used but less-known one is Wi-Fi band steering. In all these cases, the solution works within a range of implementations, which usually covers most client needs. However, just one step outside that range can result in lengthy investigations regarding how such a simple thing like sending an e-mail or joining a Wi-Fi network can go wrong.

During one of our application security assessments, we found an implementation which abused the way HTTP worked. The functionality required server push, but they probably decided to use HTTP anyway to get through corporate firewalls, and the application came from the age before WebSockets and HTTP2. So they came up with the idea of sending headers like the one below, then keeping the connection open, without sending any data.

HTTP/1.0 200 OK
Server: IDealRelay
Date: Wed, 16 Sep 2015 18:03:24 GMT
Content-length: 10000000
Connection: close
Content-type: text/plain

Googling the value of the Server header revealed mostly false positives, however as it turned out, there’s even a patent US 7734791 B2 with the title Asynchronous hypertext messaging that describes this behavior. Sending data to the server happened over separate HTTP channels, which themselves returned worthless responses, and the real response arrived in the first HTTP channel promising to deliver 10 megabytes in its Content-length header.

Some corporate proxies might have handled this well, however the Proxy module of Burp Suite just waited for the full 10 MB to arrive and just hung there, waiting for all eternity – and even if it managed to receive some data, its Scanner module couldn’t have handled the correlation between sending a payload in one request and getting response in another. In order to solve the problem, I decided to throw together a simple HTTPS proxy that doesn’t assume that much about the contents, just accepts CONNECT requests, performs a Man-in-the-Middle (MITM) attack and dumps the plaintext traffic into a file.

Multiple streams needed to be handled asynchronously, so I chose Erlang for its unique properties, and used built-in modules for everything except generating fake certificates for MITM. Since such certificates are cached, I chose to sign them using the command line version of OpenSSL for the sake of simplicity, so a whitelist is applied to the hostname to avoid command injection attacks – even though it’s supposed to be a tool used to debug “well behaving” applications, it never hurts to protect and setting good example. As an output format, PCAP was chosen as it’s simple (the PCAP writer itself is 32 lines of Erlang code) and widely supported by tools such as Wireshark.

The proxy logic fits into 124 lines of code, and waits for new TCP connections. By setting the socket into HTTP mode, the standard library does all the parsing for the CONNECT request. A standard response is sent, and from this point, the mode gets changed back to raw, and both the server and the client gets an SSL handshake from the proxy. Server certificates used for MITM are cached in an ETS (built-in high-performance in-memory database) table, shared between the lightweight thread-like Erlang processes, so certificates are signed only once per hostname, and the private key is the same for the CA and all the certificates.

After the handshake traffic is simply forwarded between the peers, and simultaneously written into the PCAP file. TCP/IP headers are added with the same port/address tuple as the original stream, while sequence/acknowledgement values are adjusted to reflect the plaintext content. This way, Wireshark doesn’t have any problems with reassembling the stream and can even detect and dissect known application protocols. Source code of the SSL proxy is available in our GitHub repository under MIT license, pull requests are welcome.

With the plaintext traffic in our hands, we managed to develop another tool to experiment with the service and as a result, we found several vulnerabilities, including critical ones. So the lesson to learn is that just because tools cannot intercept traffic out of the box, it doesn’t mean that the application is secure – existing tools are great for lots of purposes but one big difference between hackers and script kiddies is the ability of former to develop their own tools.

Quick and dirty Android binary XML edits

Last week I had an Android application that I wanted to test in the Android emulator (the official one included in the SDK). I had the application installed from Play Store on a physical device, and as I’ve done many times, I just grabbed it using Drozer and issued the usual ADB command to install it on the emulator. (The sizes and package names have been altered to protect the innocent.)

$ adb install hu.silentsignal.blogpost.apk
1337 KB/s (27313378 bytes in 22.233s)
        pkg: /data/local/tmp/hu.silentsignal.blogpost.apk
Failure [INSTALL_FAILED_CONTAINER_ERROR]

A quick search on the web revealed that the application most probably had the installLocation parameter set to preferExternal in the manifest file. Latter is an XML file called AndroidManifest.xml that contains important metadata about Android applications and is transformed into a binary representation upon compilation to reduce the size and processing power required on resource-constrained devics. Running android-apktool converted it back to text format, and revealed that it was indeed the cause.

<?xml version="1.0" encoding="utf-8"?>
<manifest android:versionCode="133744042" android:versionName="4.13.37"
  android:installLocation="preferExternal" package="hu.silentsignal.blogpost"
  xmlns:android="http://schemas.android.com/apk/res/android">

Most results of the web search agreed that the emulator (although capable of emulating SD cards) is incompatible with this setting, some suggested increasing the memory of the emulated device, others said the same about the SD card, but unfortunately none of these worked for us. The majority of accepted answers solved the problem by changing the preferExternal parameter to auto, which is the default.

Changing an Android application and repackaging it is also a breeze usually, apktool supports this natively, I just have to sign the resulting APK with a key of my own. However, this application used some features that apktool (and other tools invoked in the process, including AAPT) didn’t like. I’ve met this situation before, and there are usually two solutions.

  1. Removing such features in a way that the application can still be tested is a cumbersome series of iterations, and leads to almost certain insanity.

  2. Updating apktool and some dependencies so that they accept such features is rather painless, as it usually Just Works™.

However in this case, even beta versions couldn’t handle the task, I’ve even written some wrapper scripts around the dependencies to tweak with parameters like minimum and targeted API levels, but got nowhere. Then I realized, binary XML files still have to store the attribute names somewhere, so I fired up a hex editor and hoped that the Android runtime won’t complain about unknown attributes and will use the default value. It seemed that the format uses UTF-16 to store strings, and at offset 0x112, I changed the i to o, resulting in installLocatoon.

00f0  3a080000 82080000  0f006900 6e007300  |:.........i.n.s.|
0100  74006100 6c006c00  4c006f00 63006100  |t.a.l.l.L.o.c.a.|
0110  74006f00 6f006e00  00000b00 76006500  |t.o.o.n.....v.e.|
0120  72007300 69006f00  6e004300 6f006400  |r.s.i.o.n.C.o.d.|
0130  65000000 0b007600  65007200 73006900  |e.....v.e.r.s.i.|

Then, I simply updated the manifest in the APK file using a ZIP tool since APK files are just ZIP archives with specific naming conventions, just like JAR, DOCX and ODT files. Since the digital signature of the APK is now corrupted, I had to resign it with jarsigner, but installation still failed.

$ 7z -tzip a hu.silentsignal.blogpost.apk AndroidManifest.xml

7-Zip [64] 9.20  Copyright (c) 1999-2010 Igor Pavlov  2010-11-18
p7zip Version 9.20 (locale=hu_HU.UTF-8,Utf16=on,HugeFiles=on,4 CPUs)

Scanning

Updating archive hu.silentsignal.blogpost.apk

Compressing  AndroidManifest.xml      

Everything is Ok
$ jarsigner -sigalg SHA1withRSA -digestalg SHA1 \
    -keystore s2.jks hu.silentsignal.blogpost.apk s2
Enter Passphrase for keystore: 
$ adb install hu.silentsignal.blogpost.apk
1337 KB/s (27313378 bytes in 22.233s)
        pkg: /data/local/tmp/hu.silentsignal.blogpost.apk
Failure [INSTALL_PARSE_FAILED_NO_CERTIFICATES]

How can there be no certificates? Let’s list the contents with a ZIP tool, and look for the META-INF, an old friend from the Java world (JAR files) that contains a list of files with name and (in case of Android, SHA-1) hash (MANIFEST.MF), a public key (*.RSA in case of RSA), and a signature of the former with the latter (*.SF). The name of the public key and signature files are the same as their alias in the keystore converted to uppercase; see the last parameter of jarsigner in the above commands (s2).

   Date      Time    Attr     Size   Compressed  Name
------------------- ----- -------- ------------  --------------------
2014-04-03 14:44:42 .....   362941       110461  META-INF/MANIFEST.MF
2014-04-03 14:44:42 .....   369105       113933  META-INF/S2.SF
2014-04-03 14:44:42 .....     2046         1865  META-INF/S2.RSA
2014-02-27 13:37:16 .....      928          637  META-INF/CERT.RSA
2014-02-27 13:37:16 .....   361020       113942  META-INF/CERT.SF

See? It’s there! Also, another certificate (CERT.*), since without apktool (which rebuilds APK archives from scratch) everything except AndroidManifest.xml is included from the original file. Let’s delete the META-INF directory, and try again.

$ 7z -tzip d hu.silentsignal.blogpost.apk META-INF

7-Zip [64] 9.20  Copyright (c) 1999-2010 Igor Pavlov  2010-11-18
p7zip Version 9.20 (locale=hu_HU.UTF-8,Utf16=on,HugeFiles=on,4 CPUs)

Updating archive hu.silentsignal.blogpost.apk

      
Everything is Ok
$ jarsigner -sigalg SHA1withRSA -digestalg SHA1 \
    -keystore s2.jks hu.silentsignal.blogpost.apk s2
Enter Passphrase for keystore: 
$ adb install hu.silentsignal.blogpost.apk                            
1337 KB/s (27313378 bytes in 22.233s)
        pkg: /data/local/tmp/hu.silentsignal.blogpost.apk
Success

With this last modification, it worked, and I was able to explore the application within the emulator that makes it much more easier to set a global proxy and manipulate certificates than a physical device, and I haven’t even mentioned faking sensors and GSM information, or taking snapshots. The lesson here was that

  • if a difficult format takes more than 5 minutes to recreate, it’s worth considering manual editing in case of simple modifications,
  • most XML readers (including the Android runtime) tend to ignore unknown attributes, and
  • the META-INF directory should be removed before signing an APK file, otherwise the runtime refuses it.

Thanks to Etamme for the featured image Androids, licensed under CC-BY-3.0

Sanitizing input with regex considered harmful

Sanitizing input (as in trying to remove a subset of user input so that the remaining parts become “safe”) is hard to get right in itself. However, many developers doom their protection in the first place by choosing the wrong tool to get it done, in this case, regular expressions (regex for short). While they’re powerful for quite a few purposes, as the old saying goes,

Some people, when confronted with a problem, think “I know, I’ll use regular expressions.” Now they have two problems.

During a recent pentest, we found an application that did this by stripping HTML tags from a string by replacing the regular expression <.*?> with an empty string. (Apparently, they haven’t read the best reaction to processing HTML with regexes.) For those wondering about the question mark after the star, it disables the default greedy behavior of the engine, so the expression matches a less-than sign, as few characters as possible of any kind, and a greater-than sign. At first sight, one might think that’s the definition of an HTML tag, and for a minute we also believed it was the case.

In regexes, the dot matches any character. However, the definition of any excludes newlines (ASCII 0x0a, \n) by default in most implementations, while the HTML standard allows for such characters inside tags, which gives us a specific class of tags that are valid in browsers but are not stripped by the above algorithm. Below are some examples of platforms used to implement web applications and their behavior regarding this “challenge”. Some libraries have similar solutions, but only one thing was common in these five languages; by default, the above expression fails the test. For the sake of brevity and readability, examples were produced in interactive shells (REPLs); in case of Java and .NET, Jython and IronPython were used, respectively.

Java

>>> from java.util.regex import Pattern
>>> p = Pattern.compile('<.*?>')
>>> p.matcher('<foobar>').replaceAll('')
u''
>>> p.matcher('<foo\nbar>').replaceAll('')
u'<foo\nbar>'

The official documentation states that dot matches “any character (may or may not match line terminators)”. The link points to a section that says “The regular expression . matches any character except a line terminator unless the DOTALL flag is specified.” [emphasis added] Adding the flag solves the problem, as it can be seen below.

>>> p = Pattern.compile('<.*?>', Pattern.DOTALL)
>>> p.matcher('<foobar>').replaceAll('')
u''
>>> p.matcher('<foo\nbar>').replaceAll('')
u''

Python

>>> import re
>>> re.sub('<.*?>', '', '<foobar>')
''
>>> re.sub('<.*?>', '', '<foo\nbar>')
'<foo\nbar>'

Python follows a similar path, even the flag is called the same: “In the default mode, this matches any character except a newline. If the DOTALL flag has been specified, this matches any character including a newline” [emphasis added].

>>> re.sub('<.*?>', '', '<foobar>', flags=re.DOTALL)
''
>>> re.sub('<.*?>', '', '<foo\nbar>', flags=re.DOTALL)
''

PHP

php > var_dump(preg_replace("/<.*?>/", "", "<foobar>"));
string(0) ""
php > var_dump(preg_replace("/<.*?>/", "", "<foo\nbar>"));
string(9) "<foo
bar>"

Although PHP has an interactive mode (php -a), return values are silently discarded, and var_dump doesn’t escape newlines. However, it clearly illustrates that it behaves just like the others, but PHP doesn’t mention this behavior in the official manual for preg_replace (even though a user comment points it out, it lacks the solution). The PCRE modifiers page has the answer, the s modifier should be used, and it even shows the longer name for it (PCRE_DOTALL), although there’s no way to use it, in contrast with Python’s solution (re.S is equivalent to re.DOTALL).

php > var_dump(preg_replace("/<.*?>/s", "", "<foobar>"));
string(0) ""
php > var_dump(preg_replace("/<.*?>/s", "", "<foo\nbar>"));
string(0) ""

.NET

>>> from System.Text.RegularExpressions import Regex
>>> Regex.Replace('<foobar>', '<.*?>', '')
''
>>> Regex.Replace('<foo\nbar>', '<.*?>', '')
'<foo\nbar>'

Of course, Microsoft surprises noone by having its own solution for the problem. In their documentation on regexes, they also mention that dot “matches any single character except \n”, but you have to figure it out yourself; there’s no link to the Singleline member of RegexOptions.

>>> from System.Text.RegularExpressions import RegexOptions
>>> r = Regex('<.*?>', RegexOptions.Singleline)
>>> r.Replace('<foobar>', '')
''
>>> r.Replace('<foo\nbar>', '')
''

Ruby

irb(main):001:0> "<foobar>".sub!(/<.*?>/, "")
=> ""
irb(main):002:0> "<foo\nbar>".sub!(/<.*?>/, "")
=> nil

Ruby performs as usual, having easy-to-write/hard-to-read shorthands, however, its solution is almost as dumbfounding as the above. Like PHP, it expects modifiers as lowercase characters after the trailing slash (/), but it interprets s as a signal to interpret the regex as SJIS encoding (I never knew it even existed), and wants you to use m (called MULTILINE by the official documentation, adding to the confusion), which is used for other purposes in other regular expression engines.

irb(main):005:0> "<foobar>".sub!(/<.*?>/m, "")
=> ""
irb(main):004:0> "<foo\nbar>".sub!(/<.*?>/m, "")
=> ""

Javascript

> "<foobar>".replace(/<.*?>/, "")
''
> "<foo\nbar>".replace(/<.*?>/, "")
'<foo\nbar>'
> "<foo\nbar>".replace(/<.*?>/m, "")
'<foo\nbar>'

JavaScript has three modifiers (igm), none of them useful for making dot match literally any character. The only solution is to do it explicitly, the best one of these seems to be matching the union of whitespace and non-whitespace characters.

> "<foo\nbar>".replace(/<[\s\S]*?>/, "")
''
> "<foobar>".replace(/<[\s\S]*?>/, "")
''

Conclusion

The above solutions address a single problem only (stripping HTML tags having line breaks), processing untrusted input is much more than this. If you build a web application that must display such content, use a proper library for this purpose, preferably a templating language that performs escaping by default. For other purposes, use a DOM and don’t forget to test for corner cases, including both valid and broken HTML.