A common problem for interactive contents, such as wikis and blogs, is how to manage user input. Users commonly want, or need, to include HTML markup in their entries. But a system that permits markup is vulnerable to errors, both accidental and malicious. AJAX solutions can help somewhat with the former, but we still need to deal with the latter to avoid problems like cross-site scripting.
The trivial way to do this is just to escape all markup characters:
& ==> &
> ==> >
< ==> <
" ==> "
Some software combines this approach with pseudo-markup that can convert to basic HTML: headings, paragraphs, emphasis, links, even tables, and can be adequate for many applications, though sometimes frustrating for users who know HTML and don't want more indirection.
For software that accepts actual markup from clients, we need a more focussed approach. This is the classic problem of untrusted input, and calls for a classic taint-checking solution. The right tool for such checking is a markup-aware parser. libxml2 is ideal for this, offering support for XML, HTML and tag-soup, and the capability to validate.
There are two basic approaches:
- Ensure only clean, safe markup gets stored on the server.
- Clean up the markup as we serve it.
Clearly (1) is the best solution where feasible, while (2) is a useful fallback for cases where we don't adequately control the contents.