Re: Hashtable in the kernel

"Eddie Pettis" <pettis.eddie@xxxxxxxxx> · Sun, 22 Apr 2007 14:40:37 -0400

On 4/16/07, Tzahi Fadida <Tzahi.ML@xxxxxxxxx> wrote:
On Monday 16 April 2007 03:20:42 Eddie Pettis wrote:
> [snip]

I was referring to overall fullness of the data structure not to each key
level. The goal was to the space saved in memory compared to average key
search as you define it. However, in the real world, some of the data could
behave fine but there could be some spikes which is what we want to ignore.

Sorry I didn't understand before.

> [snip]

Just looking at the numbers, it is too straight. just divide the inserts by
bins and you get roughly the avg search. I don't believe it will even come
close to that in real life. Therefor i would define goals first and derive
the metrics from that. For example, i wish to save space and only on certain
cases allow resizing, but only if the avg search is over a certain value.
I don't believe this can be derived in advanced since real life are not
uniform.
It was a hacked together simulation, so I'm not surprised.

I had to change the hash function for my string-based hash table
because it wasn't very good at creating uniform random output.  After
fixing the hash function to be more ASCII friendly, my real world
numbers seem to be fairly balanced.  For ~40,000 keys hashed into 256
bins, I get the following statistics:

mean = 160
median = 159
standard deviation = 11.97
min = 129
max = 200

In another benchmark with ~522,000 keys hashed into 1024 bins,  I get
the following statistics:

mean = 510
median = 510
standard deviation = 19.55
min = 450
max = 578

As I said, these look pretty balanced.

> While a non-scientific argument, I would also suggest that average
> search depth should be the resizing metric because a programmer is
> more likely to ask "How long does it take to get data out of the hash
> table?" rather than "How full is the hash table?"

If it is compatible to your goal sure, however, if you have a small device,
your are also concerned with space (especially if you have many hashtables).
True.

> Of course, all these computations are subject to the standard
> assumptions on uniform random input and hashes.  Your results may
> vary, if you have different assumptions.

Yes, i believe they will vary for non-uniform distributions which is what
happens in real life, therefor, i don't believe we can predict in advance for
all cases the optimal space/average search hashtable size.

btw, what happens when you employ your suggested method of advancing to the
top most recently used entries for each key. It could widely improve.
I uploaded a new version to my website that allows the most recently
accessed keys to float to the top of the list.  Basically, it splices
the list out of the chain and places it on the head of the list after
being accessed.  I chose this technique because different keys are
more popular during different phases of execution.  When the
popularity changes, the keys are reshuffled quickly.

It works fairly well in practice.  With 39,936 entries in a 256 bin
table, I get an average search depth of 39 (expect ~78).  With 522,240
entries in a 1024 bin table, average search depth is 132 (expect
~255).

Dynamic resizing would definitely help because the 1024 bin table is
seriously overloaded and causes a major performance drag, even on our
lab's new 2 CPU, 8-core server.

I suggest perhaps a more tuned plan, where is, you assign each entry a hit
count, then, when you pass it over the chain you can increase an entry
retrieved. If the hit count is greater than the hit count of the parent entry
you do a switch with the parent. You can handle corner cases by resetting the
particular list or any other method will do (like dividing by 2^(maxbits of
counter/2)).
For stable applications, the hit count would be very effective.  When
applications are dynamic, such as my case, I think it would take too
long to re-sort the table to get the benefit.  That is, by time the
hit count increases, the program has already moved on to a different
execution phase.  A decay value would help, but it would also require
traversing the entire table to apply it.

--

Eddie Pettis
Electrical and Computer Engineering
Purdue University

--
To unsubscribe from this list: send an email with
"unsubscribe kernelnewbies" to ecartis@xxxxxxxxxxxx
Please read the FAQ at http://kernelnewbies.org/FAQ