On 4/13/07, Tzahi Fadida <Tzahi.ML@xxxxxxxxx> wrote:
On Friday 13 April 2007 17:52:16 Eddie Pettis wrote: > > I thought about maybe to create an expandable hashtable, for fun :). > > i.e. you define the number of items per hashkey and if there is no room > > you rebuild the hashtable with a higher order. You can also add that you > > rebuild if the table is xx% filled (probably 67% is a golden ratio) to > > avoid too many rebuilds. > > I don't know if percentage filled is the proper metric for chained > hash tables such as these. Average search depth is probably a better > metric, possibly maintained through an exponential average with a low > decay factor. However, percentage filled is an excellent metric if > you implement an open addressing table. My main concern and reason for percentage is that what happens if the keys of the entries in the buckets are not differentiated even in the next rebuild or 2 rebuilds.
That's just a consequence of hashing in general. Well-designed hashing algorithms are designed to generate a uniform random hash. You can construct a malicious set of keys that create pathologically bad behavior, but that scenario is unlikely in practice. In short, I wouldn't worry about this problem because you will _on_average_ improve search depth by resizing.
That could prove very wasteful, thus, the percentage is meant as a barrier to prevent too many rebuilds, which could waste space and lower performance. You can still use other metrics on top of it. I.e., you can set it at 50% and use average search depth, etc...
I understand that you don't want to be rebuilding constantly. I ran a few little experiments comparing the two metrics of percent fullness and average search depth. Suppose you wanted the 67% full metric at the hlist_head level. Then, you will need to resize after approximately the following number of inserts (computed using a quick MATLAB simulation with the average of 64 runs for each number of bins): bins = 128 inserts = 171 avg search = 1.66 (after resizing) = 1.33 bins = 256 inserts = 286 avg search = 1.56 (after resizing) = 1.28 bins = 512 inserts = 575 avg search = 1.56 (after resizing) = 1.28 bins = 1024 inserts = 1189 avg search = 1.58 (after resizing) = 1.29 bins = 2048 inserts = 2284 avg search = 1.56 (after resizing) = 1.28 Suppose you wanted an average search of 1.6. You get the following number of inserts (computed from average search depth formula): bins = 128 inserts = 154 avg search = 1.60 (after resizing) = 1.30 bins = 256 inserts = 308 avg search = 1.60 (after resizing) = 1.30 bins = 512 inserts = 615 avg search = 1.60 (after resizing) = 1.30 bins = 1024 inserts = 1229 avg search = 1.60 (after resizing) = 1.30 bins = 2048 inserts = 2458 avg search = 1.60 (after resizing) = 1.30 This is the same thing, just a different way of looking at it. Hence, you can approximate the behavior of table fullness using average search time. The maximum range that the fullness scheme can handle is 100% fullness, which produces the following number of insertions: bins = 128 inserts = 788 avg search = 4.07 (after resizing) = 2.54 bins = 256 inserts = 1275 avg search = 3.49 (after resizing) = 2.24 bins = 512 inserts = 4184 avg search = 5.08 (after resizing) = 3.04 bins = 1024 inserts = 8765 avg search = 5.28 (after resizing) = 3.14 bins = 2048 inserts = 26668 avg search = 7.51 (after resizing) = 4.26 The fullness scheme has a nonlinear effect on how long it takes data to be extracted from the hash table. This makes it difficult to predict the hash table's performance both before and after the resize. In contrast, the average search time allows you to define a constant performance level that is maintained as the table becomes larger. For example, if you want to have an average search depth of 3, you can analytically compute how large your table should be and resize only when that constraint is violated, as shown below: bins = 128 inserts = 513 avg search = 3.00 (after resizing) = 2.00 bins = 256 inserts = 1025 avg search = 3.00 (after resizing) = 2.00 bins = 512 inserts = 2049 avg search = 3.00 (after resizing) = 2.00 bins = 1024 inserts = 4097 avg search = 3.00 (after resizing) = 2.00 bins = 2048 inserts = 8193 avg search = 3.00 (after resizing) = 2.00 While a non-scientific argument, I would also suggest that average search depth should be the resizing metric because a programmer is more likely to ask "How long does it take to get data out of the hash table?" rather than "How full is the hash table?" Of course, all these computations are subject to the standard assumptions on uniform random input and hashes. Your results may vary, if you have different assumptions. -- Eddie Pettis Electrical and Computer Engineering Purdue University -- To unsubscribe from this list: send an email with "unsubscribe kernelnewbies" to ecartis@xxxxxxxxxxxx Please read the FAQ at http://kernelnewbies.org/FAQ