An Architecture for Smart Filtering in Apache

The Filter Chain is the single most significant innovation that serves to make Apache 2 a versatile and powerful platform for web applications. But lacking a content-sensitive means of dynamic configuration, it still presents difficulties when working with dynamic content that is not known in advance, such as in a Proxy.

In this article, we develop an architecture for dynamic configuration. We present an implementation that is entirely compatible with existing filter modules and applications, but offers additional configuration options for dynamic use.

The proposal described in this article is an abstraction of an existing smart filtering proxy developed by the author. It remains work in progress and subject to refinement.

Add Comment

Content Filtering in Apache


The strength of the Apache Filter chain is the ability to process a data stream independently of the generation of the stream, so we can apply identical processing to, for example, a static file, a CGI script, or a proxy handler.

Some examples of content filter modules for Apache include

  • mod_include, Apache's implementation of Server Side Includes.
  • mod_transform transforms XML markup using an XSLT stylesheet.
  • mod_proxy_html rewrites HTML links into a proxy's address space
  • mod_charset_lite changes the character encoding of text files served.
  • mod_deflate compresses or uncompresses files according to browser preferences.
show annotation

Note by anonymous, Wed Jun 29 03:21:40 2005

mod_proxy_html

show annotation

Note by anonymous, Thu Jun 7 18:46:55 2007

test of annotation.

show annotation

Note by anonymous, Mon May 26 07:00:15 2008

田田

Add Comment

Filtering in a Dynamic Context


As it stands, the filter architecture presents problems when used in with unknown content; either in a proxy or with a local handler that generates different content types to order. The basic difficulty lies in the Apache configuration. Content filters need to be applied conditionally: for example, we don't want to pass images through an HTML filter. The generic configuration directives for filters are:

SetOutputFilter
Unconditionally insert a filter.
AddOutputFilter, RemoveOutputFilter
Insert or remove a filter based on "extension".
AddOutputFilterByType
Insert a filter based on Content Type.

In the case of a proxy, extensions are meaningless, as we cannot know what conventions an origin server might adopt. Neither is AddOutputFilterByType (nor its hypothetical siblings such as AddOutputFilterByEncoding or AddOutputFilterByLanguage) any use, because the response headers from the proxy are unknown at the time the filter is inserted and initialised. So we have to resort to the unsatisfactory hack of inserting a filter unconditionally, checking the response headers from the proxy, and then having the filter remove itself where appropriate. Examples of filters that will do this are mod_deflate, mod_xmlns, mod_accessibility and mod_proxy_html.

Add Comment

Pre- and Post-Processing


As with an origin server, it may be necessary to preprocess data before the main content-transforming filter, and/or postprocess afterwards. For example, when dealing with gzipped content we need to uncompress it for processing and re-compress the processed data. Similarly in an image-processing filter, we need to decode the original image format and re-encode the processed data.

This may involve more than one phase. For example, when filtering text, we may need to both to uncompress gzipped data and transcode the character set before the main filter.

So, potentially we have a large multiplicity of filters: transformation filters, together with pre- and post-processing for different content types and encodings. To repeat the hack of having each filter inserted and determining whether to run or remove itself in such a setup goes beyond simple inelegance and into the absurd. An alternative architecture is required.

show annotation

Note by anonymous, Tue May 13 22:10:50 2008

Add Comment

An Architecture for Smart Filtering


The key to a generic smart filter is that it must operate late in the processing chain, so that the response headers are available for context-sensitive dispatching. We must therefore drop the filter_init handler from the structure, and defer some configuration-based decisions.

Dropping the filter_init is of course trivial: filters that use it can preserve the function, and simply call it from the main filter callback when f->ctx is unset.

To deal with late dispatching, we propose an updated architecture, to be implemented as a generic mod_filter. The purpose of mod_filter is to handle smart configuration on behalf of filter modules in general.

Add Comment

Generic Filter Module


To enable smart, context-sensitive filtering, we propose a revised architecture, to be implemented in a new module mod_filter:

  • A generic filter harness
  • Filter modules as providers
  • Dispatch based on headers

The Filter Harness is a standard filter module callback that can be inserted into the output filter chain. Its purpose is to select a content filter conditionally based on response headers, and dispatch to it. If no content filter is selected, the harness will simply uninsert itself. Note that this (slightly unusual) usage requires us to have set up the filter context in advance and supplied it to ap_add_output_filter.


typedef struct {
  const char* name ;
  apr_status_t (*func)(ap_filter_t* f, apr_bucket_brigade* bb) ;
  void* fctx ;
} harness_ctx ;

static apr_status_t filter_harness(ap_filter_t* f, apr_bucket_brigade* bb) {

  apr_status_t ret ;
  harness_ctx* ctx = f->ctx ;

/* look up a handler function if we haven't already set it */
  if ( ! ctx->func ) {
    ctx->func = lookup_handler(f->r, ctx->name) ;
    if ( ! ctx->func ) {
      ap_remove_output_filter(f) ;
      return ap_pass_brigade(f->next, bb) ;
    }
  }

/* call the content filter with its own context, then restore our context */
  f->ctx = ctx->fctx ;
  ret = ctx->func(f, bb) ;
  ctx->fctx = f->ctx ;
  f->ctx = ctx ;
  return ret ;
}

We can see from the above that an Apache 2.0 output filter can slot comfortably into the proposed architecure, by returning its main callback from the lookup_handler function in the above. What remains is to update the configuration and filter hooks. mod_filter will implement these changes.

show annotation

Note by anonymous, Sun May 11 10:59:09 2008

Add Comment

Configuration


Since we are proposing a new filter, we can and should take the opportunity to rationalise all aspects of filter configuration. The new mod_filter will implement

FilterDeclare filter-name level
The handler for this will call ap_register_output_filter to register filter_harness under the name supplied and at the level supplied.
FilterDispatcher filter-name header-name match-criteria
This specifies an HTTP header and a pattern-matching criterion to be used on its value to extract a key on which we dispatch.
FilterProvider filter-name match-value handler
This specifies a match on the criterion supplied by FilterDispatcher, and a filter handler (as provided by any Apache 2.0 filter module) to be used as the dispatcher when the match is successful.

The existing directives (SetOutputFilter and family) can then be used to insert or remove filters in the output chain as before. Likewise, ap_add_output_filter can be used programmatically to insert a smart filter, just as it can a traditional filter.

[ Question: do we want to abstract FilterDispatcher further to enable RewriteRule/RewriteCond-like definition of dispatch criteria ? ]

show annotation

Note by anonymous, Tue Apr 29 14:47:16 2008

Yes! plz spend more time to make better match-criteria and [AND] conditons :)

Add Comment

Protocol Handling


The business of a content filter is to deal with content. As far as possible, it should not be concerned with details of the HTTP protocol. Yet the presence of a filter may affect the response headers.

At this point, we need to make a distinction. Some modules are closely tied with one or more HTTP header: for example, mod_deflate affects Content-Encoding, or an XSLT transform affects Content-Type. Such cases are clearly the responsibility of the filter itself, and the current level of abstraction (apr_table) is appropriate.

But in some common cases, headers are affected as a by-product of a filter. Examples are Content-Length, Content-Range, ETag, and Warning. It is wasteful and error-prone for every filter to have to deal with these headers. Instead of expecting that, our generic filter harness module should abstract the protocol and deal with these headers on behalf of modules. We can identify three output filter cases:

  • No change to the content.
  • Byte values change but length is preserved. Headers such as Content-MD5 are invalidated, but others like Content-Length are preserved.
  • Arbitrary content transformation. Several headers are affected. For HTTP/1.1 requests, we should ensure the response is chunked (unless force-response-1.0 is in effect).

A second distinction concerns whether cacheability is affected by a filter. There may be more than one way to deal with this, as demonstrated by the different uses of XBitHack with SSI. We should abstract the cacheing behaviour affecting Last-Modified and Cache-Control headers.

Request headers may also be affected.

Add Comment

Implementation


Filters currently register using
ap_register_output_filter(name, filter_func, filter_init, ftype)
and are inserted using
ap_add_output_filter(name, ctx, req, conn)

To update this to run in a smart filter context, we need to change this:

  • The filter should register itself with the harness and insertion criterion.
  • The filter should declare its effect on the HTTP protocol.

To handle this, we propose a new filter API
ap_register_smart_filter(name, match, filter_func, ctx, protocol_flags)
Now when the harness name is inserted in the filter chain, and there is a match with match, lookup_handler (referenced above) will returh our filter_func for filter name.

protocol_flags is an OR of flags such as

  • AP_FILTER_NO_TRANSFORM (filter passes content through unchanged - required when Cache-Control: no-transform applies)
  • AP_FILTER_PRESERVE_LENGTH (filter is a 1-1 mapping of bytes)
  • AP_FILTER_NO_BYTERANGES (filter won't work on byteranges)
  • AP_FILTER_NO_PROXY (don't use in a proxy)
  • AP_FILTER_NO_CACHE (don't let the document be cacheable downstream - probably wants refining)
  • AP_FILTER_NO_LOCAL_CACHE (don't let it be cacheable here)