Using markup-aware tools to sanitise user-contributed markup.

I keep hearing about the difficulty of managing user-contributed web contents, and of allowing users to enter markup while avoiding badly broken markup or security issues including cross-site scripting. My usual reaction to this is that the presenter or speaker is making the task unnecessarily difficult by failing to use the right tools. Like trying to protect against SQL injection attacks without using the simple and reliable solution of prepared statements.

This article briefly outlines some approaches based on markup-aware tools. So next time I argue the point with someone, I can refer them here.

It is not intended as a full explanation or a tutorial, merely a brief exposition of the underlying idea. Read it, and think about how you could apply similar techniques to your situation.


A common problem for interactive contents, such as wikis and blogs, is how to manage user input. Users commonly want, or need, to include HTML markup in their entries. But a system that permits markup is vulnerable to errors, both accidental and malicious. AJAX solutions can help somewhat with the former, but we still need to deal with the latter to avoid problems like cross-site scripting.

The trivial way to do this is just to escape all markup characters:

  & ==> &
  > ==> >
  < ==> &lt;
  " ==> &quot;

Some software combines this approach with pseudo-markup that can convert to basic HTML: headings, paragraphs, emphasis, links, even tables, and can be adequate for many applications, though sometimes frustrating for users who know HTML and don't want more indirection.

For software that accepts actual markup from clients, we need a more focussed approach. This is the classic problem of untrusted input, and calls for a classic taint-checking solution. The right tool for such checking is a markup-aware parser. libxml2 is ideal for this, offering support for XML, HTML and tag-soup, and the capability to validate.

There are two basic approaches:

  1. Ensure only clean, safe markup gets stored on the server.
  2. Clean up the markup as we serve it.

Clearly (1) is the best solution where feasible, while (2) is a useful fallback for cases where we don't adequately control the contents.

Filtering for Security

If we're filtering on the fly as we deliver a page, then processing efficiency is an important consideration. Thus we should clearly prefer a streaming filter over mod_security. A SAX-based filter such as mod_publisher or mod_proxy_html combines markup awareness with high performance.

With mod_publisher, as with mod_security, we can filter on any part of a page. But in fact it's only the markup we're concerned with, so a markup-aware filter enables us to focus on what matters and ignore what is safe.

Taint checking calls us to be proactive, and define exactly what is acceptable. In mod_publisher, we can do that by specifying a DTD with the MLDTD directive (mod_proxy_html and mod_accessibility have somewhat-similar features, but based on libxml2's builtin knowledge of HTML rather than a configurable DTD). With the DTD loaded, (at startup: we can't afford to re-parse it for every request!) we can check every element and attribute against it in our SAX handler:

static el_type elem_type(saxctxt* ctx, const xmlChar* xname) {
  /* [chop] */
  if (dtd) {
    xmlElementPtr elt = xmlGetDtdElementDesc(dtd, xname) ;
    if (elt)
      return (elt->etype == XML_ELEMENT_TYPE_EMPTY)
      return ELEM_BAD ; /* no such element - suppress it */
  /* [chop] */

and for each attribute

  if (dtd && (xmlGetDtdAttrDesc(dtd, name, attname) == NULL)) {
    continue ;  /* suppress this attribute */

So we can untaint markup on the fly simply by defining a DTD that excludes <script> and scripting events. This is a blunt-instrument approach: it also loses scripts we want (though mod_publisher can also be used to insert them)! Furthermore, we may want to restrict the DTD further if we want to protect buggy browsers that allegedly interpret other text such as stylesheets as script. So this solution has its limitations.

Securing Uploaded Contents

The preferred approach is to clean up contents as they are uploaded. This gets round the limitations of filtering: we can apply strict rules to a defined part of the page (like the body, or something within it) rather than a whole page, and use unlimited scripting outside the user-contributed (untrusted) parts.

Since editing a page is not a performance-critical operation, we can afford the luxury of a DOM to manipulate the contents. The existing page (or template) is parsed to a DOM, and the user-contributed (untrusted) contribution is parsed to a DOM. The latter can be fully validated before inserting it into the former DOM, and rejected with an HTTP error such as 400 (Bad Request) or 422 (Unprocessable Entity) if it's not acceptable. This catches attacks and accidental errors alike, and for the latter, the system should tell the user exactly what's wrong so they can fix it and resubmit.

This is the approach taken by mod_annot, which allows users to edit sections of a page: in HTML, the contents of a <div>. It uses a DTD that defines a subset of XHTML 1.0 designed to be appropriate for the purpose, as well as just safe (the DTD used by ApacheTutor can be viewed through the Help page).

The most relevant code from mod_annot is

    if (cfg->dtd) {
      buf= apr_psprintf(r->pool, "<?xml version=\"1.0\"?>\n"
                                 "<!DOCTYPE a:content SYSTEM \"%s\">\n"
                    dircfg->dtd, apr_table_get(args, "text") ) ;
    } else {
      buf= apr_psprintf(r->pool, "<?xml version=\"1.0\"?>\n"
		                apr_table_get(args, "text") ) ;
      options = 0 ;

    newdoc = xmlReadMemory(buf, strlen(buf), NULL, NULL, options) ;
    if ( ! newdoc ) {
      ap_log_rerror(APLOG_MARK, APLOG_ERR, 0, r, "Error parsing input") ;

    if (cfg->dtd) {
      int ret = 0 ;
      xmlValidCtxtPtr vctx = xmlNewValidCtxt() ;

      if ( ! xmlValidateDocument(vctx, newdoc) ) {
        ap_log_rerror(APLOG_MARK, APLOG_ERR, 0, r, "Error validating input") ;
      xmlFreeValidCtxt(vctx) ;
      if ( ret != 0 ) {
        xmlFreeDoc(newdoc) ;
        return ret ;
    /* now insert newdoc, together with metadata (author, date, version) */

(this code could almost certainly be implemented more efficiently, but at the time it was written there was no API to validate a fragment against a DTD in memory).