Re: Hashtable in the kernel

Tzahi Fadida <Tzahi.ML@xxxxxxxxx> · Mon, 16 Apr 2007 12:29:33 +0300

On Monday 16 April 2007 03:20:42 Eddie Pettis wrote:
> > My main concern and reason for percentage is that what happens if the
> > keys of the entries in the buckets are not differentiated even in the
> > next rebuild or 2 rebuilds.
>
> That's just a consequence of hashing in general.  Well-designed
> hashing algorithms are designed to generate a uniform random hash.
> You can construct a malicious set of keys that create pathologically
> bad behavior, but that scenario is unlikely in practice.
>
> In short, I wouldn't worry about this problem because you will
> _on_average_ improve search depth by resizing.

Depends on your goals. see below.

>
> > That could prove very wasteful, thus, the percentage is meant as
> > a barrier to prevent too many rebuilds, which could waste space and lower
> > performance. You can still use other metrics on top of it. I.e., you can
> > set it at 50% and use average search depth, etc...
>
> I understand that you don't want to be rebuilding constantly.  I ran a
> few little experiments comparing the two metrics of percent fullness
> and average search depth.
>
> Suppose you wanted the 67% full metric at the hlist_head level.  Then,
> you will need to resize after approximately the following number of
> inserts (computed using a quick MATLAB simulation with the average of
> 64 runs for each number of bins):

I was referring to overall fullness of the data structure not to each key 
level. The goal was to the space saved in memory compared to average key 
search as you define it. However, in the real world, some of the data could 
behave fine but there could be some spikes which is what we want to ignore.

> bins = 128	inserts = 171	avg search = 1.66	(after resizing) = 1.33
> bins = 256	inserts = 286	avg search = 1.56	(after resizing) = 1.28
> bins = 512	inserts = 575	avg search = 1.56	(after resizing) = 1.28
> bins = 1024	inserts = 1189	avg search = 1.58	(after resizing) = 1.29
> bins = 2048	inserts = 2284	avg search = 1.56	(after resizing) = 1.28
>
> Suppose you wanted an average search of 1.6.  You get the following
> number of inserts (computed from average search depth formula):
>
> bins = 128	inserts = 154	avg search = 1.60	(after resizing) = 1.30
> bins = 256	inserts = 308	avg search = 1.60	(after resizing) = 1.30
> bins = 512	inserts = 615	avg search = 1.60	(after resizing) = 1.30
> bins = 1024	inserts = 1229	avg search = 1.60	(after resizing) = 1.30
> bins = 2048	inserts = 2458	avg search = 1.60	(after resizing) = 1.30
>
> This is the same thing, just a different way of looking at it.  Hence,
> you can approximate the behavior of table fullness using average
> search time.
>
> The maximum range that the fullness scheme can handle is 100%
> fullness, which produces the following number of insertions:
>
> bins = 128	inserts = 788	avg search = 4.07	(after resizing) = 2.54
> bins = 256	inserts = 1275	avg search = 3.49	(after resizing) = 2.24
> bins = 512	inserts = 4184	avg search = 5.08	(after resizing) = 3.04
> bins = 1024	inserts = 8765	avg search = 5.28	(after resizing) = 3.14
> bins = 2048	inserts = 26668	avg search = 7.51	(after resizing) = 4.26
>
> The fullness scheme has a nonlinear effect on how long it takes data
> to be extracted from the hash table.  This makes it difficult to
> predict the hash table's performance both before and after the resize.
>
> In contrast, the average search time allows you to define a constant
> performance level that is maintained as the table becomes larger.  For
> example, if you want to have an average search depth of 3, you can
> analytically compute how large your table should be and resize only
> when that constraint is violated, as shown below:
>
> bins = 128	inserts = 513	avg search = 3.00	(after resizing) = 2.00
> bins = 256	inserts = 1025	avg search = 3.00	(after resizing) = 2.00
> bins = 512	inserts = 2049	avg search = 3.00	(after resizing) = 2.00
> bins = 1024	inserts = 4097	avg search = 3.00	(after resizing) = 2.00
> bins = 2048	inserts = 8193	avg search = 3.00	(after resizing) = 2.00

Just looking at the numbers, it is too straight. just divide the inserts by 
bins and you get roughly the avg search. I don't believe it will even come 
close to that in real life. Therefor i would define goals first and derive 
the metrics from that. For example, i wish to save space and only on certain 
cases allow resizing, but only if the avg search is over a certain value.
I don't believe this can be derived in advanced since real life are not 
uniform.

> While a non-scientific argument, I would also suggest that average
> search depth should be the resizing metric because a programmer is
> more likely to ask "How long does it take to get data out of the hash
> table?" rather than "How full is the hash table?"

If it is compatible to your goal sure, however, if you have a small device, 
your are also concerned with space (especially if you have many hashtables).

> Of course, all these computations are subject to the standard
> assumptions on uniform random input and hashes.  Your results may
> vary, if you have different assumptions.

Yes, i believe they will vary for non-uniform distributions which is what 
happens in real life, therefor, i don't believe we can predict in advance for 
all cases the optimal space/average search hashtable size. 

btw, what happens when you employ your suggested method of advancing to the 
top most recently used entries for each key. It could widely improve.
I suggest perhaps a more tuned plan, where is, you assign each entry a hit 
count, then, when you pass it over the chain you can increase an entry 
retrieved. If the hit count is greater than the hit count of the parent entry 
you do a switch with the parent. You can handle corner cases by resetting the 
particular list or any other method will do (like dividing by 2^(maxbits of 
counter/2)).
-- 
Regards,
        Tzahi.
--
Tzahi Fadida
Blog: http://tzahi.blogsite.org | Home Site: http://tzahi.webhop.info
WARNING TO SPAMMERS:  see at 
http://members.lycos.co.uk/my2nis/spamwarning.html

--
To unsubscribe from this list: send an email with
"unsubscribe kernelnewbies" to ecartis@xxxxxxxxxxxx
Please read the FAQ at http://kernelnewbies.org/FAQ