Re: Hashtable in the kernel

"sandeep lahane" <sandeep.lahane@xxxxxxxxx> · Mon, 23 Apr 2007 12:13:45 +0530

On 4/23/07, Eddie Pettis <pettis.eddie@xxxxxxxxx> wrote:
On 4/16/07, Tzahi Fadida <Tzahi.ML@xxxxxxxxx> wrote:
> On Monday 16 April 2007 03:20:42 Eddie Pettis wrote:
> > [snip]
>
> I was referring to overall fullness of the data structure not to each key
> level. The goal was to the space saved in memory compared to average key
> search as you define it. However, in the real world, some of the data could
> behave fine but there could be some spikes which is what we want to ignore.
>

Sorry I didn't understand before.

> > [snip]
>
> Just looking at the numbers, it is too straight. just divide the inserts by
> bins and you get roughly the avg search. I don't believe it will even come
> close to that in real life. Therefor i would define goals first and derive
> the metrics from that. For example, i wish to save space and only on certain
> cases allow resizing, but only if the avg search is over a certain value.
> I don't believe this can be derived in advanced since real life are not
> uniform.

It was a hacked together simulation, so I'm not surprised.

I had to change the hash function for my string-based hash table
because it wasn't very good at creating uniform random output.  After
fixing the hash function to be more ASCII friendly, my real world
numbers seem to be fairly balanced.  For ~40,000 keys hashed into 256
bins, I get the following statistics:

mean = 160
median = 159
standard deviation = 11.97
min = 129
max = 200

In another benchmark with ~522,000 keys hashed into 1024 bins,  I get
the following statistics:

mean = 510
median = 510
standard deviation = 19.55
min = 450
max = 578

As I said, these look pretty balanced.

> > While a non-scientific argument, I would also suggest that average
> > search depth should be the resizing metric because a programmer is
> > more likely to ask "How long does it take to get data out of the hash
> > table?" rather than "How full is the hash table?"
>
> If it is compatible to your goal sure, however, if you have a small device,
> your are also concerned with space (especially if you have many hashtables).

True.

> > Of course, all these computations are subject to the standard
> > assumptions on uniform random input and hashes.  Your results may
> > vary, if you have different assumptions.
>
> Yes, i believe they will vary for non-uniform distributions which is what
> happens in real life, therefor, i don't believe we can predict in advance for
> all cases the optimal space/average search hashtable size.
>
> btw, what happens when you employ your suggested method of advancing to the
> top most recently used entries for each key. It could widely improve.

I uploaded a new version to my website that allows the most recently
accessed keys to float to the top of the list.  Basically, it splices
the list out of the chain and places it on the head of the list after
being accessed.  I chose this technique because different keys are
more popular during different phases of execution.  When the
popularity changes, the keys are reshuffled quickly.

It works fairly well in practice.  With 39,936 entries in a 256 bin
table, I get an average search depth of 39 (expect ~78).  With 522,240
entries in a 1024 bin table, average search depth is 132 (expect
~255).

Dynamic resizing would definitely help because the 1024 bin table is
seriously overloaded and causes a major performance drag, even on our
lab's new 2 CPU, 8-core server.

> I suggest perhaps a more tuned plan, where is, you assign each entry a hit
> count, then, when you pass it over the chain you can increase an entry
> retrieved. If the hit count is greater than the hit count of the parent entry
> you do a switch with the parent. You can handle corner cases by resetting the
> particular list or any other method will do (like dividing by 2^(maxbits of
> counter/2)).

For stable applications, the hit count would be very effective.  When
applications are dynamic, such as my case, I think it would take too
long to re-sort the table to get the benefit.  That is, by time the
hit count increases, the program has already moved on to a different
execution phase.  A decay value would help, but it would also require
traversing the entire table to apply it.

--

Eddie Pettis
Electrical and Computer Engineering
Purdue University

--
To unsubscribe from this list: send an email with
"unsubscribe kernelnewbies" to ecartis@xxxxxxxxxxxx
Please read the FAQ at http://kernelnewbies.org/FAQ

I uploaded a new version to my website that allows the most recently
accessed keys to float to the top of the list.  Basically, it splices
the list out of the chain and places it on the head of the list after
being accessed.  I chose this technique because different keys are
more popular during different phases of execution.  When the
popularity changes, the keys are reshuffled quickly.

How about using splay trees? They do exactly the same thing i.e.
recently accessed elements are pushed to top. Basic operations like
insertion, look-up and removal in O(log(n)). I don't know how you have
implemented it using lists, but using splay trees minimizes book
keeping also.
--

Regards,
Sandeep.

--
To unsubscribe from this list: send an email with
"unsubscribe kernelnewbies" to ecartis@xxxxxxxxxxxx
Please read the FAQ at http://kernelnewbies.org/FAQ