mod_annot editor

Annotate Section

Motivation

So, why do we need buckets and brigades? Can't we just keep it simple and pass simple blocks of data? Maybe a void* with a length, or a C++ string?

Well, the first part of the answer we've seen already: buckets are more than just data: they are an abstraction that unifies fundamentally different types of data. But even so, how do they justify the additional complexity over simple buffers and ad-hoc use of other data sources in exceptional cases?

The second motivation for buckets and brigades is that they enable efficient manipulation of blocks of memory, typical of many filtering applications. We will demonstrate a simple but typical example of this: a filter to display plain text documents prettified as an HTML page, with header and footer supplied by the webmaster in the manner of a fancy directory listing.

Now, HTML can of course include blocks of plain text, enclosing them in <pre> to preserve spacing and formatting. So the main task of a text->html filter is to pass the text straight through. But certain special characters need to be escaped. To be safe both with the HTML spec and browsers, we will escape the four characters <, >, &, and " as &lt, etc.

Because the replacement &lt; is longer by three bytes than the original, we cannot just replace the character. If we are using a simple buffer, we either have to extend it with realloc() or equivalent, or copy the whole thing interpolating the replacement. Repeat this a few times and it rapidly gets very inefficient. A better solution is a two-pass scan of the buffer: the first pass simply computes the length of the new buffer, after which we allocate the memory and copy the data with the required replacements. But even that is by no means efficient.

By using buckets and brigades in place of a simple buffer, we can simply replace the characters in situ, without allocating or copying any big blocks of memory. Provided the number of characters replaced is small in comparison to the total document size, this is much more efficient. In outline:

  1. We encounter a character that needs replacing in the bucket
  2. We split the bucket before and after the character. Now we have three buckets: the character itself, and all data before and after it.
  3. We drop the character, leaving the before and after buckets.
  4. We create a new bucket containing the replacement, and insert it where the character was.

Now instead of moving/copying big blocks of data, we are just manipulating pointers into an existing block. The only actual data to change are the single character removed and the few bytes that replace it.