## 2013年4月20日 星期六

### Cuckoo Hashing

Rasmus Pagh 1
IT University of Copenhagen, Rued Langgaardsvej 7, 2300 Kobenhavn S, Denmark
Flemming Friche Rodler
ON-AIR A/S, Digtervejen 9, 9200 Aalborg SV, Denmark.

* Partially supported by the Future and Emerging Technologies programme of the EU under contract number IST-1999-14186 (ALCOM-FT). This work was initiated while visiting Stanford University

* Basic Research in Computer Science (www.brics.dk), funded by the Danish National Research Foundation.
* This work was done while at Aarhus University

==============================================================

### Abstract:

We present a simple dictionary with worst case constant lookup time, equaling the theoretical performance of the classic dynamic perfect hashing scheme of Dietzfelbinger et al. (Dynamic perfect hashing: Upper and lower bounds. SIAM J. Comput., 23(4):738–761, 1994). The space usage is similar to that of binary search trees. Besides being conceptually much simpler than previous dynamic dictionaries with worst case constant lookup time, our data structure is interesting in that it does not use perfect hashing, but rather a variant of open addressing where keys can be moved back in their probe sequences. An implementation inspired by our algorithm, but using weaker hash functions, is found to be quite practical. It is competitive with the best known dictionaries having an average case (but no nontrivial worst case) guarantee on lookup time.

Key words: data structures, dictionaries, information retrieval, searching, hashing, experiments

(*Remember O() is the limit for a large number of operations. On 'average' you wont have many collisions - it's not necessary that an individual operation has no collision.

*Perfect hashing guarantees
• O(1) lookup
• O(1) insert
*Cuckoo hashing also guarantees
• O(1) lookup**
• O(1) insert**
** O(1) expected time for insert
O(1) worst case time for query/delete: because x is either at T[h1(x)] or at T[h2(x)] ),  query/delete takes worst-case two probes.)

### 1. INTRODUCTION

The dictionary data structure is ubiquitous in computer science. A dictionary is used for maintaining a set S under insertion and deletion of elements (referred to as keys) from a universe U. Membership queries ("x  S") provide access to the data. In case of a positive answer the dictionary also provides a piece of satellite data that was associated with x when it was inserted.

A large theory, partially surveyed in Section 2, is devoted to dictionaries. It is common to study the case where keys are bit strings in U ={0,1}and w is the word length of the computer (for theoretical purposes modeled as a RAM). Section 3 discusses this restriction. It is usually, though not always, clear how to return associated information once membership has been determined. E.g., in all methods discussed in this paper, the associated information of x  S can be stored together with x in a hash table. Therefore we disregard the time and space used to handle associated information and concentrate on the problem of maintaining S. In the following we let n denote |S|.

The most e fficient dictionaries, in theory and in practice, are based on hashing techniques. The main performance parameters are of course lookup time, update time, and space. In theory there is no trade-o between these: One can simultaneously achieve constant lookup time, expected amortized constant update time, and space within a constant factor of the information theoretical minimum of B = bits [6]. In practice, however, the various constant factors are crucial for many applications. In particular, lookup time is a critical parameter. It is well known that one can achieve performance arbitrarily close to optimal if a su ciently sparse hash table is used. Therefore the challenge is to combine speed with a reasonable space usage. In particular, we only consider schemes using O(n) words of space.

The contribution of this paper is a new, simple hashing scheme called Cuckoo Hashing. A description and analysis of the scheme is given in Section 4, showing that it possesses the same theoretical properties as the dynamic dictionary of Dietzfelbinger et al. [10]. That is, it has worst case constant lookup time and amortized expected constant time for updates. A special feature of the lookup procedure is that (disregarding accesses to a small hash function description) there are just two memory accesses, which are independent and can be done in parallel if this is supported by the hardware. Our scheme works for space similar to that of binary search trees, i.e., three words per key in S on average.

Using weaker hash functions than those required for our analysis, Cuckoo Hashing is very simple to implement. Section 5 describes such an implementation, and reports on extensive experiments and comparisons with the most commonly used methods, having no nontrivial worst case guarantee on lookup time. It seems that an experiment comparing the most commonly used methods on a modern multi-level memory architecture has not previously been described in the literature. Our experiments show Cuckoo Hashing to be quite competitive, especially when the dictionary is small enough to t in cache. We thus believe it to be attractive in practice, when a worst case guarantee on lookups is desired.

### 2. PREVIOUS WORK ON LINEAR SPACE DICTIONARIES

Hashing, rst described in public literature by Dumey [12], emerged in the 1950s as a space e cient heuristic for fast retrieval of information in sparse tables. Knuth surveys the most important classical hashing methods in [18, Section 6.4]. The most prominent, and the basis for our experiments in Section 5, are Chained Hashing (with separate chaining), Linear Probing and Double Hashing. Judging from leading textbooks on algorithms, Knuth's selection of algorithms is in agreement with current practice for implementation of general purpose dictionaries. In particular, the excellent cache usage of Linear Probing makes it a prime choice on modern architectures. A more recent scheme called Two-Way Chaining [2] will also be investigated. All schemes are brie y described in Section 5.

#### 2.1. Analysis of early hashing schemes

Early theoretical analysis of hashing schemes was done under the assumption that hash function values are uniformly random and independent. Precise analyses of the average and expected worst case behaviors of the above mentioned schemes have been made, see for example [14, 18]. We mention just a few facts, disregarding asymptotically vanishing terms. Note that some figures depend on implementation details { the below hold for the implementations described in Section 5.

We first consider the expected number of memory probes needed by the two \open addressing" schemes to insert a key in a hash table where an α  fraction of the table, 0 < α < 1, is occupied by keys. For Linear Probing the expected number of probes during insertion is 1/2(1 + 1/(1-α)^2). This coincides with the expected number of probes for unsuccessful lookups, and with the number of probes needed for looking up the key if there are no subsequent deletions. A deletion rearranges keys to the con guration that would occur if the deleted key had never been inserted. In Double Hashing the expected cost of an insertion is 1/(1-α). As keys are never moved, this coincides with the number of probes needed for looking up the key and for deleting the key. If a key has not been inserted in the hash table since the last rehash, the expected cost of looking it up (unsuccessfully) is 1/(1-β), where is the fraction of keys and "deleted" markers in the hash table. If the key still has a "deleted" marker in the table, the expected cost of the unsuccessful lookup is one probe more.

For Chained Hashing with hash table size n/α , the expected length of the list traversed during an unsuccessful lookup is . This means that the expected number of probes needed to insert a new key is 1 + α, which will also be the number of probes needed to subsequently look up the key (note that probes to pointers are not counted). A deletion results in the data structure that would occur if the key had never been inserted.

In terms of number of probes, the above implies that, for any given , Chained Hashing is better than Double Hashing, which is again better than Linear Probing. It should be noted, however, that the space used by Chained Hashing is larger than that in the open addressing schemes for the same α . The difference depends on the relative sizes of keys and pointers.

The longest probe sequence in Linear Probing is of expected length Ω(log n). For Double Hashing the longest successful probe sequence is expected to be of length Ω(log n), and there is in general no sub-linear bound on the length of unsuccessful searches. The expected maximum chain length in Chained Hashing is Θ (log n/ log log n).

Though the above results seem to agree with practice, the randomness assumptions used for the analyses are questionable in applications. Carter and Wegman [7] succeeded in removing such assumptions from the analysis of Chained Hashing, introducing the concept of universal hash function families. When implemented with a random function from Carter and Wegman's universal family, chained hashing has constant expected time per dictionary operation (plus an amortized expected constant cost for resizing the table). For a certain efficient hash function family of Siegel [32], Linear Probing and Double Hashing provably satisfy the above performance bounds [30, 31]. Siegel's hash functions, summarized in Theorem 3.1, are also used in Cuckoo Hashing.

• f(n) ∈ Ο(g(n)) ≈ a ≤ b
• f(n) ∈ Ω(g(n)) ≈ a ≥ b
• f(n) ∈ Θ(g(n)) ≈ a = b
• f(n) ∈ ο(g(n)) ≈ a < b
• f(n) ∈ ω(g(n)) ≈ a > b

• 1.O(1)或O(c)：稱為常數時間(constant time)
2. O(n)：稱為線性時間(linear time)
3.O(log2n)：稱為次線性時間(sub-linear time)
5.O(n3)：稱為立方時間(cubic time)
6.O(2n)：稱為指數時間(exponential time)
7.O(n1og2n)：介於線性及二次方成長的中間之行為模式。

#### 2.2. Key rearrangement schemes

A number of (open addressing) hashing schemes have been proposed that share a key feature with the scheme described in this paper, namely that keys are moved around during insertions [3, 15, 19, 20, 28]. The main focus in these schemes is to reduce the average number of probes needed for fi nding a key in a (nearly) full table to a constant, rather than the O(log n) average exhibited by standard open addressing. This is done by occasionally moving keys forward in their probe sequences.

In our algorithm we rearrange keys in order to reduce the worst case number of probes to a constant. A necessary condition for this is reuse of hash function values, i.e., that keys are moved back in their probe sequence. Backward moves were not used in any previous rearrangement scheme,presumably due to the difficulty that moving keys back does not give a fresh, "random" placement. The thing that allows us to obtain worst case efficient lookups is that we do not deal with full hash tables, but rather hash tables that are less than half full. It was shown in [24] that in this case there exists, with high probability, an arrangement that allows any key to be found in two hash table probes. We show how to efficiently maintain such an arrangement under updates of the key set.

Arrangements of keys with optimal worst case retrieval cost were in fact already considered by Rivest in [28], where a polynomial time algorithm for finding such an arrangement was given. Also, it was shown that if one updates the key set, the expected number of keys that need to be moved to achieve a new optimal arrangement is constant. (The analysis requires that the hash table is sufficiently sparse, and assumes the hash function to be truly random.) This suggests a dictionary that solves a small assignment problem after each insertion and deletion. It follows from [24] and this paper, that Rivest's dictionary achieved worst case constant lookup time and expected amortized constant update time, 8 years before an algorithm with the same performance and randomness assumption was published by Aho and Lee [1]. Further, we show that Siegel's hash functions suffice for the analysis. Last but not least, the algorithm we use for rearranging keys is much simpler and more efficient than that suggested by Rivest.

Another key rearrangement scheme with similarities to Cuckoo Hashing is Last-come-first-served Hashing [27], which has low variance on search time as its key feature. It uses the same greedy strategy for moving keys as is used in this paper, but there is no reuse of hash function values.

#### 2.3. Hashing schemes with worst case lookup guarantee

Two-Way Chaining is an alternative to Chained Hashing that offers O(log log n) maximal lookup time with high probability (assuming truly random hash functions). The implementation that we consider represents the lists by fixed size arrays of size O(log log n) (if a longer chain is needed, a rehash is performed). To achieve linear space usage, one must then use a hash table of size O(n/ log log n), implying that the average chain length is  Ω(log log n).

Another scheme with this worst case guarantee is Multilevel Adaptive Hashing [5]. However, lookups can be performed in O(1) worst case time if O(log log n) hash function evaluations, memory probes and comparisons are possible in parallel. This is similar to the scheme described in this paper, though we use only two hash function evaluations, memory probes and comparisons.

A dictionary with worst case constant lookup time was first obtained by Fredman, Koml os and Szemer edi [13], though it was static, i.e., did not support updates. It was later augmented with insertions and deletions

in amortized expected constant time by Dietzfelbinger et al. [10]. Dietzfelbinger and Meyer auf der Heide [11] improved the update performance by exhibiting a dictionary in which operations are done in constant time with high probability, namely at least 1 - n-c , where c is any constant of our choice. A simpler dictionary with the same properties was later developed [8]. When n = |U|1-O(1) a space usage of O(n) words is not within a constant factor of the information theoretical minimum. The dictionary of Brodnik and Munro [6] offers the same performance as [10], B + O(B) bits in all cases. However, it does not support information associated with keys.

### 3. PRELIMINARIES

We assume that keys from U fit exactly in a single machine word, that is, U ={0,1}. A special value   U is reserved to signal an empty cell in hash tables. For Double Hashing an additional special value is used to indicate a deleted key.

Our algorithm uses hash functions from a universal family.

If U is not too large compared to k, there exists a space-efficient (2; k)-universal family due to Siegel [32] that has constant evaluation time (however, the constant is not a small one).

Theorem 3.1 (Siegel). There is a constant c such that for, k = 2^Ω(w), there exists a (2, k)-universal family that uses space and initialization time O(k^c), and which can be evaluated in constant time.

Our restriction that keys are single words is not a serious one. Longer keys can be mapped to keys of O(1) words by applying a random function from a (O(1), 2)-universal family. There is such a family whose functions can be evaluated in time linear in the number of input words [7]. It works by evaluating a function from a (O(1), 2)-universal family on each word, computing the bitwise exclusive or of the function values. (See [34] for an efficient implementation).  function with range {0,1}2log(n)+c will, with probability 1-O(2^c), be injective on S. In fact, with constant probability the function is injective on a given sequence of  Ω(2c/2n) consecutive sets in a dictionary of initial size n (see [10]). When a collision between two elements of S occurs, everything is rehashed. If a rehash can be done in expected O(n) time, the amortized expected cost of this is O(2-c/2) per insertion. In this way we can effectively reduce the universe size to O(n2), though the full keys still need to be stored to decide membership.When c = O(log n) the reduced keys are of length O(log n). For any  > 0,Theorem 3.1 then provides a family of constant time evaluable (2, n^Ω(1))-universal hash functions, whose functions can be stored using space O(n^c).

### 4. CUCKOO HASHING

Cuckoo Hashing is a dynamization of a static dictionary described in [24]. The dictionary uses two hash tables, T1 and T2, each of length r, and two hash functions h1,h2 : U → {0,...,r-1}. Every key x  S is stored in cell h1(x) of T1 or h2(x) of T2, but never in both. Our lookup function is

function lookup(x)
return T1[h1(x)] = x _ T2[h2(x)] = x
end

Two table accesses for lookup is in fact optimal among all dictionaries using linear space, except for special cases, see [24]. Remark: The idea of storing keys in one out of two places given by hash functions previously appeared in [16] in the context of PRAM simulation, and in [2] for Two-Way Chaining.

Remark: The idea of storing keys in one out of two places given by hash functions previously appeared in [16] in the context of PRAM simulation, and in [2] for Two-Way Chaining.

It is shown in [24] that if r >= (1 + ) n for some constant  > 0 (i.e., the tables are to be a bit less than half full), and h1, h2 are picked uniformly at random from an (O(1), O(log n))-universal family, the probability that there is no way of arranging the keys of S according to h1 and h2 is O(1/n). By the discussion in Section 3 we may assume without loss of generality that there is such a family, with constant evaluation time and negligible space usage. A suitable arrangement of the keys was shown in [24] to be computable in expected linear time, by a reduction to 2-sat.

We now consider a simple dynamization of the above. Deletion is of course simple to perform in constant time, not counting the possible cost of shrinking the tables if they are becoming too sparse. As for insertion, it turns out that the "cuckoo approach", kicking other keys away until every key has its own "nest", works very well. Specially, if x is to be inserted we fi rst see if cell h1(x) of T1 is occupied. If not, we are done. Otherwise we set T1[h1(x)] <--  x anyway, thus making the previous occupant "nestless".

This key is then inserted in T2 in the same way, and so forth iteratively, see Figure 1(a). It may happen that this process loops, as shown in Figure 1(b). Therefore the number of iterations is bounded by a value "MaxLoop" to be specified in Section 4.1. If this number of iterations is reached, everything is rehashed with new hash functions, and we try once again to accommodate the nestless key.

Using the notation x ↔  y to express that the values of variables x and y are swapped, the following code summarizes the insertion procedure.

procedure insert(x)
if lookup(x) then return
loop MaxLoop times
if T1[h1(x)] = ? then f T1[h1(x)]   x; return g
x $T1[h1(x)] if T2[h2(x)] = ? then f T2[h2(x)] x; return g x$ T2[h2(x)]
end loop
rehash(); insert(x)
end

The above procedure assumes that each table remains larger than (1 +  ) n cells. When no such bound is known, a test must be done to fi nd out when a rehash to larger tables is needed. Resizing of tables can be done in amortized expected constant time per update by the usual doubling/halving technique (see, e.g., [10]). The hash functions used will be (O(1), MaxLoop)-universal, which means that they will act almost like truly random functions on any set of keys processed during the insertion loop.

The lookup call preceding the insertion loop ensures robustness if the key to be inserted is already in the dictionary. A slightly faster implementation can be obtained if this is known not to occur.

Note that the insertion procedure is biased towards inserting keys in T1. As will be seen in Section 5 this leads to faster successful lookups, due to more keys being found in T1. This e ect is even more pronounced if one uses an asymmetric scheme where T1 is larger than T2. In both cases, the insertion time is only slightly worse than that of a completely symmetric implementation. Another variant is to use a single table for both hash functions, but this requires keeping track of the hash function according to which each key is placed. In the following we consider just the symmetric two table scheme.

#### 4.1. Analysis

Our analysis of the insertion procedure has three main parts:
1. We first exhibit some useful characteristics of the behavior of the insertion procedure.
2. We then derive a bound on the probability that the insertion procedure uses at least t iterations.
3. Finally we argue that the procedure uses expected amortized constant time.

Behavior of the Insertion Procedure

The simplest behavior of the insertion procedure occurs when it does not visit any hash table cell more than once. In this case it simply runs through a sequence x1,x2,......, of nestless keys with no repetitions, moving each key from one table to the other.

If, at some point, the insertion procedure returns to a previously visited cell, the behavior is more complicated, as shown in Figure 2. The key xi in the first previously visited cell will become nestless for the second time (occurring at positions i and j > i in the sequence) and be put back in its original cell. Subsequently all keys xi-1,...,x1 will be moved back where they were at the start of the insertion (assuming that the maximum number of iterations is not reached). In particular, x1 will end up nestless again, and the procedure will try placing it in the second table. At some point after this there appears a nestless key xl that is either moved to a vacant cell or a previously visited cell (again assuming that the maximum number of iterations is not reached). In the former case the procedure terminates. In the latter case a rehash must be performed, since we have a "closed loop" of  L - i + 1 keys hashing to onlyL - i cells. This means that the loop will run for the maximum number of iterations, followed by a rehash.

Lemma 4.1. Suppose that the insertion procedure does not enter a closed loop. Then for any prefix x1,x2,..., xp of the sequence of nestless keys, there must be a subsequence of at least p/3 consecutive keys without repetitions, starting with an occurrence of the key x1, i.e., the key being inserted.

Probability Bounds

We now consider the probability that the insertion loop runs for at least t iterations. For t > MaxLoop the probability is of course 0. Otherwise, by the above analysis, iteration number t is performed in two (not mutually exclusive) situations:

1. The hash function family used is not (1, MaxLoop)-universal when restricted to the set of keys in the dictionary (including the key being inserted).

2. The insertion procedure has entered a \closed loop", i.e., xl in Figure 2 was moved to a previously visited cell, for L <= 2t.
3. The insertion procedure has processed a sequence of at least (2t-1)/3 consecutive nestless keys starting with the newly inserted key.

We chose the hash function family such that the first situation occurs with probability O(1/(n^2)). Under the condition that the first situation does not occur, we now bound the probability of the two last situations.

The above derivation follows a suggestion of Sanders and V 'o'cking [32], and improves the O(1/n) bound in the conference version of this paper [27].

Concluding the Analysis
From the previous section it follows that the expected number of iterations in the insertion loop is bounded by:

Finally, we consider the cost of rehashing. First we consider only forced rehashes, caused by failed insertions. These occur if the insertion loop runs for t = MaxLoop iterations. By the previous section, the probability that this happens because of entering a closed loop, or because the hash function family fails to be (1,MaxLoop)-universal, is O(1/(n^2)). Setting MaxLoop = [3log1+e r], the probability of rehashing without entering a closed loop is, by (2), at most
2 (1 + ∈) ^−(2MaxLoop−1)/3+1 = O(1/(n^2)) .

Altogether, the probability that any given insertion causes a rehash is O(1/(n^2)). In particular, the n insertions performed during a rehash all succeed (i.e., cause no further rehash) with probability 1 − O(1/n). The expected time used per insertion is O(1), so the total expected time for trying to insert all keys is O(n). If an insertion fails during the rehash, a recursive rehash is started. Since we keep all keys in the tables all the time, this simply corresponds to starting over with another attempt at rehashing all keys. As the probability of having to start over with new hash functions is bounded away from 1, the total expected time for a rehash sums to O(n). Thus, for any insertion the expected time used for forced rehashing is O(1/n).

There will also be a rehash if r2 insertions have been performed with no failed insertions. Since the expected cost of the rehash is O(n), the amortized expected cost per insertion of such rehashes is O(1/n). Summing up, we have shown that the amortized expected time for insertion is bounded by a constant. The small probability of rehashing, together with (2), in fact implies that also the variance of the insertion time is constant.

### 4 Experiments

To examine the practicality of Cuckoo Hashing we experimentally compare it to three well known hashing methods, as described in [20, Section 6.4]: Chained Hashing (with separate chaining), Linear Probing and Double Hashing. We also consider Two-Way Chaining [2].

The first three methods all attempt to store a key x at position h(x) in a hash table. They differ in the way collisions are resolved, i.e., in what happens when two or more keys hash to the same location.

Chained Hashing. A linked list is used to store all keys hashing to a given location.

Linear Probing. A key is stored in the next empty table entry. Lookup of key x is done by scanning the table beginning at h(x) and ending when either x or an empty table entry is found. When deleting, some keys may have to be moved back in order to fill the hole in the lookup sequence, see [20, Algorithm R] for details.

Double Hashing. Insertion and lookup are similar to Linear Probing, but instead of searching for the next position one step at a time, a second hash function value is used to determine the step size. Deletions are handled by putting a special “deleted” marker in the cell of the deleted key. Lookups skip over deleted cells, while insertions overwrite them.

The fourth method, Two-Way Chaining, can be described as two instances of Chained Hashing. A key is inserted in one of the two hash tables, namely the one where it hashes to the shorter chain. A cache-friendly implementation, as recently suggested in [6], is to simply make each linked list a short, fixed size array. If a longer list is needed, a rehash must be performed.

#### 4.1 Previous Experimental Results

Although the dictionaries with worst case constant lookup time surveyed in Section 3 leave little to improve from a theoretical point of view, large constant factors and complicated implementation hinder their direct practical use. For example, in the “dynamic perfect hashing” scheme of [10] the upper bound on space is 35n words. The authors of [10] refer to a more practical variant due to Wenzel that uses space comparable to that of binary search trees. According to [19] the implementation of this variant in the LEDA library [25], described in [39], has average insertion time larger than that of AVL trees for n · 2^17, and more than four times slower than insertions in chained hashing. (On a Linux PC with an Intel@ Pentium@ 120 MHz processor.) The experimental results listed in [25, Table 5.2] show a gap of more than a factor of
6 between the update performance of chained hashing and dynamic perfect hashing, and a factor of more than 2 for lookups. (On a 300 MHz SUN ULTRA SPARC.)

Silverstein [36] reports that the space upper bound of the dynamic perfect hashing scheme of [10] is quite pessimistic compared to what can be observed when run on a subset of the DIMACS dictionary tests [24]. He goes on to explore ways of improving space as well as time, improving both the observed time and space by a factor of roughly three. Still, the improved scheme needs 2 to 3 times more space than an implementation of linear probing to achieve similar time per operation. Silverstein also considers versions of the data structures with packed representations of the hash tables. In this setting the dynamic perfect hashing scheme was more than 50% slower than linear probing, using roughly the same amount of space.

Is seems that recent experimental work on “classical” dictionaries (that do not have worst case constant lookup time) is quite limited. In [19] it is reported that chained hashing is superior to an implementation of dynamic perfect hashing in terms of both memory usage and speed.

#### 4.2 Data Structure Design and Implementation

We consider positive 32 bit signed integer keys and use 0 as . The data structures are robust in that they correctly handle attempts to insert an element already in the set, and attempts to delete an element not in the set. During rehashes this is known not to occur and slightly faster versions of the insertion procedure are used.

Apart from Chained Hashing, the schemes considered have in common the fact that they have only been analyzed under randomness assumptions that are currently impractical to realize. However, experience shows that rather simple and efficient hash function families yield performance close to that predicted under stronger randomness assumptions. We use a function family from [9] with range {0, 1}q for positive integer q. For every odd a, 0 < a < 2w, the family contains the function  ha(x) = (ax mod 2w) div 2w-q. Note that evaluation can be done very efficiently by a 32 bit multiplication and a shift. However, this choice of hash function restricts us to consider hash tables whose sizes are powers of two. A random function from the family (chosen using C’s rand function) appears to work fine with all schemes except Cuckoo Hashing. For Cuckoo Hashing we experimented with various hash functions and found that Cuckoo Hashing was rather sensitive to the choice of hash function. It turned out that the exclusive or of three independently chosen functions from the family of [9] was fast and worked well. We have no good explanation for this phenomenon. For all schemes, various alternative hash families were tried, with a decrease in performance.

All methods have been implemented in C. We have striven to obtain the fastest possible implementation of each scheme. Specific choices made and details differing from the references are:

Chained Hashing. C’s malloc and free functions were found to be a performance bottleneck, so a simple “freelist” memory allocation scheme is used. Half of the allocated memory is used for the hash table, and half for list elements. If the data structure runs out of free list elements, its size is doubled. We store the first key of each linked list directly in the hash table, as this often saves one cache miss. Having the first key in the hash table also slightly improves memory utilization, in the expected sense. This is because every non-empty linked list is one element shorter and because we expect more than half of the hash table cells to contain a linked list for the load factors considered here.

Double Hashing. To prevent the tables from clogging up with deleted cells, resulting in poor performance for unsuccessful lookups, all keys are rehashed when 2/3 of the hash table is occupied by keys and “deleted” markers. The fraction 2/3 was found to give a good trade-off between the time for insertion and unsuccessful lookups.

Linear Probing. Our first implementation, like that in [36], employed deletion markers. However, we found that using the deletion method described in [20, Algorithm R] was considerably faster, as far fewer rehashes were needed.

Two-Way Chaining. We allow four keys in each bucket. This is enough to keep the probability of a forced rehash low for hundreds of thousands of keys, by the results in [6]. For larger collections of keys one should allow more keys in each bucket, resulting in general performance degradation.

Cuckoo Hashing. The architecture on which we experimented could not parallelize the two memory accesses in lookups. Therefore we only evaluate the second hash function after the first memory lookup has shown unsuccessful.

For all schemes, rehashing was implemented as repeated insertion of all keys into a newly allocated hash table. For efficiency we used special insertion procedures without a check of whether keys were already inserted.

Some experiments were done with variants of Cuckoo Hashing. In particular, we considered Asymmetric Cuckoo, in which the first table is twice the size of the second one. This results in more keys residing in the first table, thus giving a slightly better average performance for successful lookups. For example, after a long sequence of alternate insertions and deletions at load factor 1/3, we found that about 76% of the elements resided in the first table of Asymmetric Cuckoo, as opposed to 63% for Cuckoo Hashing. There was no significant slowdown for other operations. We will describe the results for Asymmetric Cuckoo when they differ significantly from those of Cuckoo Hashing.

#### 4.3 Setup

Our experiments were performed on a PC running Linux (kernel version 2.2) with an 800 MHz Intel@ Pentium@ III processor, and 256 MB of memory (PC100 RAM). The processor has a 16 KB level 1 data cache and a 256 KB level 2 “advanced transfer” cache. Our results nicely fit a simple model parameterized by the cost of a cache miss and the expected number of probes to “random” locations (see the technical report [28] for details). They are thus believed to have significance for other hardware configurations. An advantage of using the Pentium@ processor for timing experiments is its rdtsc instruction which can be used to measure time in clock cycles. This gives access to very precise data on the behavior of algorithms, and allows us to discard the time used by the program issuing the calls to the Cuckoo Hashing data structure. In our case it also supplies a way of discarding measurements significantly disturbed by interrupts from hardware devices or the process scheduler, as these show up as a small group of timings significantly separated from all other timings. Programs were compiled using the gcc compiler version 2.95.2, using optimization flags -O9 -DCPU=586 -march=i586 -fomit-frame-pointer -finline-functions-fforce-mem -funroll-loops -fno-rtti. As mentioned earlier, we use a global clock cycle counter to time operations. If the number of clock cycles spent on a dictionary operation exceeds 5000, and there was no rehash, we conclude that the call was interrupted, and disregard the result (it was empirically observed that no operation ever took between 2000 and 5000 clock cycles). If a rehash is made, we have no way of filtering away time spent in interrupts. However, all tests were made on a machine with no irrelevant user processes, so disturbances should be minimal. On our machine it took 32 clock cycles to call the rdtsc instruction. These clock cycles have been subtracted from the results.

#### 4.4 Results

Our main experiment was designed to model the situation in which the size of the dictionary is not changing too much. It considers a sequence of mixed operations generated at random.We constructed the test operation sequences from a collection of high quality random bits publicly available on the Internet [23]. The sequences start by insertion of n distinct random keys, followed by 3n times four operations: A random unsuccessful lookup, a random successful lookup, a random deletion, and a random insertion. We timed the operations in the “equilibrium”, where the number of elements is stable. For load factor 1/3 our results appear in Figures 3 and 4, which show an average over 10 runs. We ran experiments with up to (2^24)/3 keys. As Linear Probing was consistently faster than Double Hashing, we chose it as the sole open addressing scheme in the plots. Time for forced rehashes was added to the insertion time. The results had a large variance, over the 10 runs, for sets of size 212 to 216. Outside this range the extreme values deviated from the average by less than about 7%. The large variance sets in when the data structure starts to fill the level 2 cache. We believe this is caused by our test program reading data from disk and thus sometimes evicting parts of the data structure from cache.

As can be seen, the time for lookups is almost identical for all schemes as long as the entire data structure fits in level 2 cache, i.e., for n < 216/3. After this the average number of accesses to a random memory cell (with the probability of a cache miss approaching 1) shows up. The shape of the curves reflect the increasing probability of a cache miss for an access to a random memory cell (see Section 5 of the technical report [28] for details). This makes linear probing an average case winner, with Cuckoo Hashing and Two-Way Chaining following about 40 clock cycles behind. For insertion the number of accesses to a random memory cell again dominates the picture for large sets, while the higher number of in-cache accesses and more computation makes Cuckoo Hashing, and in particular Two-Way chaining, slower for small sets.

The cost of forced rehashes sets in for Two-Way Chaining for sets of more than a million elements, at which point better results may have been obtained by a larger bucket size. For deletion Chained Hashing lags behind for large sets due to accesses to a random memory cell when freeing list elements, while the simplicity of Cuckoo Hashing makes it the fastest scheme. We note that, for dictionaries that fit in cache, the total time for an insertion and a deletion is smallest for Cuckoo Hashing among the four schemes.

At this point we should mention that the good cache utilization of Linear Probing and Two-Way Chaining depends on the cache lines being considerably larger than keys (and any associated information placed together with keys). If this is not the case, it causes the number of cache misses to rise significantly. The other schemes discussed here do not deteriorate in this way.

We made additional experiments concerning the cost of insertions in growing dictionaries and deletions in shrinking dictionaries, which takes into account the cost of rehashes needed to keep space utilization around 1/3. The interested reader can find the results of these tests in the technical report [28].

#### DIMACS Tests

Access to data in a dictionary is rarely random in practice. In particular, the cache is more helpful than in the above random tests, for example due to repeated lookups of the same key, and deletion of short-lived keys. As a rule of thumb, the time for such operations will be similar to the time when all of the data structure is in cache. To perform actual tests of the dictionaries on more realistic data, we chose a representative subset of the dictionary tests of the 5th DIMACS implementation challenge [24]. The tests involving string keys were pre-processed by hashing strings to 32 bit integers, as described in Appendix A.

This preserves, with high probability, the access pattern to keys. For each test we recorded the average time per operation, not including the time used for pre-processing. The minimum and maximum of six runs can be found in Tables 5 and 6, which also lists the average load factor. Linear probing is again the fastest, but mostly just 20-30% faster than the Cuckoo schemes.

#### The Number of Cache Misses During Insertion

We have seen that the number of accesses to a random memory cell (i.e., cache misses) is critical to the performance of hashing schemes. Whereas there is a very precise understanding of the probe behavior of the classic schemes (under suitable randomness assumptions), the analysis of the expected time for insertions in Section 2.3 is rather crude, establishing just a constant upper bound. One reason that our calculation does not give a very tight bound is that we use a pessimistic estimate on the number of key moves needed to accommodate a new element in the dictionary. Often a free cell will be found even though it could have been occupied by another key in the dictionary. We also pessimistically assume that a large fraction of key moves will be spent backtracking from an unsuccessful attempt to place the new key in the first table.

Figure 7 shows experimentally determined values for the average number of probes during insertion for various schemes and load factors below 1/2. We disregard reads and writes to locations known to be in cache, and the cost of rehashes. Measurements were made in “equilibrium” after 10^5 insertions and deletions, using tables of size 2^15 and truly random hash function values. We believe that this curve is independent of the table size (up to vanishing terms). The curve for Linear Probing does not appear, as the number of non-cached memory accesses depends on cache architecture (length of the cache line), but it is typically very close to 1. The curve for Cuckoo Hashing seems to be 2 + 1/(4 + 8α) ¼ 2 + 1/(4²). This is in good correspondence with (3) of the analysis in Section 2.3. It should be remarked that the highest possible load factor for Two-Way Chaining is O(1/ log log n).

As noted in Section 2, the insertion algorithm of Cuckoo Hashing is biased towards inserting keys in T1. If we instead of starting the insertion in T1 choose the start table at random, the number of cache misses decreases slightly for insertion. This is because the number of free cells in T1 increases as the load balance becomes even. However, this also means a slight increase in lookup time. Also note that since insertion checks if the element is already inserted, Cuckoo Hashing uses at least two cache misses. The initial lookup can be exploited to get a small improvement in insertion performance, by inserting right away when either cell T1[h1(x)] or T2[h2(x)] is vacant. For load factor 1/3 this places about 10% of newly inserted keys in T2. The relatively low percentage is the reason why we found no advantage in performing the extra check in our implementation.

Since lookup is very similar to insertion in Chained Hashing, one could think that the number of cache misses would be equal for the two operations. However, in our implementation, obtaining a free cell from the freelist may result in an extra cache miss. This is the reason why the curve for Chained Hashing in the figure differs from a similar plot in Knuth [20, Figure 44].

### 5 Conclusion

We have presented a new dictionary with worst case constant lookup time. It is very simple to implement, and has average case performance comparable to the best previous dictionaries. Earlier schemes with worst case constant lookup time were more complicated to implement and had worse average case performance. Several challenges remain. First of all an explicit, truly practical hash function family that is provably good for the scheme has yet to be found. One step in this direction was recently taken by Dietzfelbinger and Woelfel [12], but their hash functions still require a relatively large amount of space. Secondly, we lack a precise understanding of why the scheme exhibits low constant factors. In particular, the curve of Figure 7 needs to be explained.

Acknowledgements. The authors would like to thank Andrei Broder, Martin Dietzfelbinger, Rolf Fagerberg, Peter Sanders, John Tromp, and Berthold V¨ocking for useful comments and discussions on this paper and Cuckoo Hashing in general.