mod_annot editor

Annotate Section

Dealing with multimedia content

We just set up a proxy to parse and where necessary correct HTML. But of course, the web isn't just HTML. Surely feeding non-HTML content through an HTML parser is at best inefficient, if not totally broken?

Yes indeed. mod_proxy_html deals with that by checking the Content-Type header, and removing itself from the processing chain when a document is not HTML (text/html) or XHTML (application/xhtml+xml). This happens in the filter initialisation phase, before any data are processed by the filter.

But that still leaves a problem. Consider compressed HTML:


        Content-Type: text/html
        Content-Encoding: gzip

Feeding that into an HTML parser is clearly broken!

There are two solutions to this. One is to uncompress the incoming data with mod_deflate. Uncompressing and compressing content radically reduces network traffic, but increases the processor load on the proxy. It is worthwhile if and only if bandwidth between the proxy and the backend is at a premium: this is common on the 'net at large, but unlikely to be the case on a company internal network.


SetOutputFilter	INFLATE;proxy-html;DEFLATE

The alternative solution is to refuse to support compression. Stripping any Accept-Encoding request header does the job. So invoking mod_headers, we add a directive


RequestHeader unset Accept-Encoding

This should only apply to the Proxy, so we put it inside our <Location> containers.

A similar situation arises in the case of encrypted (https) content. But in this case, there is no such workaround: if we could decrypt the data to process it then so could any other man-in-the-middle, and the security would be worthless. This can only be circumvented by installing mod_ssl and a certificate on the proxy, so that the actual secure session is between the browser and the proxy, not the origin server.