Re: Hashtable in the kernel

"Eddie Pettis" <pettis.eddie@xxxxxxxxx> · Sun, 15 Apr 2007 19:20:42 -0500

On 4/13/07, Tzahi Fadida <Tzahi.ML@xxxxxxxxx> wrote:
On Friday 13 April 2007 17:52:16 Eddie Pettis wrote:
> > I thought about maybe to create an expandable hashtable, for fun :).
> > i.e. you define the number of items per hashkey and if there is no room
> > you rebuild the hashtable with a higher order. You can also add that you
> > rebuild if the table is xx% filled (probably 67% is a golden ratio) to
> > avoid too many rebuilds.
>
> I don't know if percentage filled is the proper metric for chained
> hash tables such as these.  Average search depth is probably a better
> metric, possibly maintained through an exponential average with a low
> decay factor.  However, percentage filled is an excellent metric if
> you implement an open addressing table.

My main concern and reason for percentage is that what happens if the keys of
the entries in the buckets are not differentiated even in the next rebuild or
2 rebuilds.

That's just a consequence of hashing in general.  Well-designed
hashing algorithms are designed to generate a uniform random hash.
You can construct a malicious set of keys that create pathologically
bad behavior, but that scenario is unlikely in practice.

In short, I wouldn't worry about this problem because you will
_on_average_ improve search depth by resizing.

That could prove very wasteful, thus, the percentage is meant as
a barrier to prevent too many rebuilds, which could waste space and lower
performance. You can still use other metrics on top of it. I.e., you can set
it at 50% and use average search depth, etc...

I understand that you don't want to be rebuilding constantly.  I ran a
few little experiments comparing the two metrics of percent fullness
and average search depth.

Suppose you wanted the 67% full metric at the hlist_head level.  Then,
you will need to resize after approximately the following number of
inserts (computed using a quick MATLAB simulation with the average of
64 runs for each number of bins):

bins = 128	inserts = 171	avg search = 1.66	(after resizing) = 1.33
bins = 256	inserts = 286	avg search = 1.56	(after resizing) = 1.28
bins = 512	inserts = 575	avg search = 1.56	(after resizing) = 1.28
bins = 1024	inserts = 1189	avg search = 1.58	(after resizing) = 1.29
bins = 2048	inserts = 2284	avg search = 1.56	(after resizing) = 1.28

Suppose you wanted an average search of 1.6.  You get the following
number of inserts (computed from average search depth formula):

bins = 128	inserts = 154	avg search = 1.60	(after resizing) = 1.30
bins = 256	inserts = 308	avg search = 1.60	(after resizing) = 1.30
bins = 512	inserts = 615	avg search = 1.60	(after resizing) = 1.30
bins = 1024	inserts = 1229	avg search = 1.60	(after resizing) = 1.30
bins = 2048	inserts = 2458	avg search = 1.60	(after resizing) = 1.30

This is the same thing, just a different way of looking at it.  Hence,
you can approximate the behavior of table fullness using average
search time.

The maximum range that the fullness scheme can handle is 100%
fullness, which produces the following number of insertions:

bins = 128	inserts = 788	avg search = 4.07	(after resizing) = 2.54
bins = 256	inserts = 1275	avg search = 3.49	(after resizing) = 2.24
bins = 512	inserts = 4184	avg search = 5.08	(after resizing) = 3.04
bins = 1024	inserts = 8765	avg search = 5.28	(after resizing) = 3.14
bins = 2048	inserts = 26668	avg search = 7.51	(after resizing) = 4.26

The fullness scheme has a nonlinear effect on how long it takes data
to be extracted from the hash table.  This makes it difficult to
predict the hash table's performance both before and after the resize.

In contrast, the average search time allows you to define a constant
performance level that is maintained as the table becomes larger.  For
example, if you want to have an average search depth of 3, you can
analytically compute how large your table should be and resize only
when that constraint is violated, as shown below:

bins = 128	inserts = 513	avg search = 3.00	(after resizing) = 2.00
bins = 256	inserts = 1025	avg search = 3.00	(after resizing) = 2.00
bins = 512	inserts = 2049	avg search = 3.00	(after resizing) = 2.00
bins = 1024	inserts = 4097	avg search = 3.00	(after resizing) = 2.00
bins = 2048	inserts = 8193	avg search = 3.00	(after resizing) = 2.00

While a non-scientific argument, I would also suggest that average
search depth should be the resizing metric because a programmer is
more likely to ask "How long does it take to get data out of the hash
table?" rather than "How full is the hash table?"

Of course, all these computations are subject to the standard
assumptions on uniform random input and hashes.  Your results may
vary, if you have different assumptions.

--

Eddie Pettis
Electrical and Computer Engineering
Purdue University

--
To unsubscribe from this list: send an email with
"unsubscribe kernelnewbies" to ecartis@xxxxxxxxxxxx
Please read the FAQ at http://kernelnewbies.org/FAQ