I’ve been asked to write a short series of technical posts about the problem of efficiently implementing a big dictionary — the fundamental data structure underlying file systems, databases and the Acunu Storage Core — and in the process hopefully shedding some light on our technology.
In this first post I want to discuss query/update tradeoffs for unversioned dictionaries. That might be a bit of a mouthful at first, but it’s not too complicated. The plan for future posts looks something like this: in part 2 we’ll examine versioned dictionaries and some tradeoffs there. In part 3, we’ll consider the problem of building a cache-oblivious versioned dictionaries, and why you’d want such a beast.
Before diving in, let me summarise a bit where Acunu sits in this landscape. We’ve developed the first suite of cache-oblivious (and “cache-aware”) algorithms that support fully-versioned dictionaries with almost-optimal query/update tradeoffs. They also keep these bounds when applied to SSDs, which have a somewhat different access model to traditional disks. Efficient versioning is a very powerful tool, and our schemes are the first that allow dynamic query/update tradeoffs for versioned data, in particular allowing very high update rates for fully versioned metadata.
Most of the algorithms I’ll cover in these posts aren’t in textbooks (yet), although several can be found in research papers. We plan to publish papers with more details and proofs; we’ll post links when they become available.
The Dictionary
An ordered dictionary supports two basic operations:
* update(key,value) associates value with key (overwriting any previous value, if any);
* query(start, end) returns all (k,v) pairs where k is between start and end.
We’re interested in ‘big’ dictionaries, by which I mean that the are far too many keys to fit into internal memory, so we must resort to using external memory, such as disk. For the past couple of decades, the B-tree has been the data structure of choice, largely because queries are fast — indeed, asymptotically optimal. But what does this mean? In order to talk more precisely about bounds on query and update cost, we need a cost model. The ‘disk access model’ (DAM) assumes we have a disk and some memory, and that data is always read or written in blocks of B elements (in other words, it’s not worth reading or writing any less than this). In practise, a large SATA disk typically has B around 256KB-512KB (but this view of the world is quite oversimplified, as we’ll discuss in part 3!)
A trivial tradeoff
With N elements, a B-tree has height

Having one’s cake and eating it
There is a very interesting tradeoff between query and update performance, and B-trees achieve only one point on this tradeoff.
Data structures that achieve much better update performance without resorting to linear scans on queries are known, for example log-structured merge (LSM) trees. Over the last decade, the algorithms community has developed a much better understanding of these tradeoffs, many of them with matching upper and lower bounds. In particular, consider the following (folklore?) scheme, based on the `logarithmic method’.

Maintain a set of geometrically-growing sorted arrays in `levels’; level k contains at most 2 arrays of size at most



After inserting N elements there are at most





Let’s put some real numbers to this. B is typically 256KB/(100 bytes per element), say 2500, and



This scheme is far from perfect (e.g. the avid reader will have noticed that the worst-case update cost is far worse than the average case), but it gives a flavour of how these kinds of data structures work; this sort of data structure is roughly what underpins Google’s BigTable, and its open source relatives including Cassandra, HBase, and so on. More recently, Tokutek’s fractal tree data structure shows how one can improve the lookup performance of this idea to

As good as it gets
Tradeoffs like these are ubiquitous in fundamental computer science problems, so you won’t be surprised to hear that a range of possible tradeoffs has been figured out. The following is possible, and known to asymptotically optimal: for any constant



I’m not going to go into any more detail than this, but it’s interesting to note that the tradeoff is exponentially asymmetric – if we’re willing to tolerate a small increase in lookup cost, we can potentially obtain a much larger reduction in update cost. To see this, set


In Part II, I look at what happens if we want to add versioning to our dictionary – that’s where it starts to get really interesting!

