So, why do we need buckets and brigades? Can't we just keep it simple
and pass simple blocks of data? Maybe a void* with a length, or a C++
string?
Well, the first part of the answer we've seen already: buckets are
more than just data: they are an abstraction that unifies fundamentally
different types of data. But even so, how do they justify the
additional complexity over simple buffers and ad-hoc use of other
data sources in exceptional cases?
The second motivation for buckets and brigades is that they enable
efficient manipulation of blocks of memory, typical of many filtering
applications. We will demonstrate a simple but typical example of this:
a filter to display plain text documents prettified as an HTML page,
with header and footer supplied by the webmaster in the manner of a
fancy directory listing.
Now, HTML can of course include blocks of plain text, enclosing them
in <pre> to preserve spacing and formatting. So the main task
of a text->html filter is to pass the text straight through.
But certain special characters need to be escaped. To be safe both
with the HTML spec and browsers, we will escape the four characters
<, >, &, and " as <, etc.
Because the replacement < is longer by three bytes than the
original, we cannot just replace the character. If we are using
a simple buffer, we either have to extend it with realloc()
or equivalent, or copy the whole thing interpolating the replacement.
Repeat this a few times and it rapidly gets very inefficient.
A better solution is a two-pass scan of the buffer:
the first pass simply computes the length of the new buffer, after
which we allocate the memory and copy the data with the required
replacements. But even that is by no means efficient.
By using buckets and brigades in place of a simple buffer, we can simply
replace the characters in situ, without allocating or copying any
big blocks of memory. Provided the number of characters replaced is
small in comparison to the total document size, this is much more
efficient. In outline:
- We encounter a character that needs replacing in the bucket
- We split the bucket before and after the character. Now
we have three buckets: the character itself, and all data before
and after it.
- We drop the character, leaving the before and after buckets.
- We create a new bucket containing the replacement, and insert
it where the character was.
Now instead of moving/copying big blocks of data, we are just
manipulating pointers into an existing block. The only actual
data to change are the single character removed and the few
bytes that replace it.