On 4/23/07, Eddie Pettis <pettis.eddie@xxxxxxxxx> wrote:
On 4/16/07, Tzahi Fadida <Tzahi.ML@xxxxxxxxx> wrote: > On Monday 16 April 2007 03:20:42 Eddie Pettis wrote: > > [snip] > > I was referring to overall fullness of the data structure not to each key > level. The goal was to the space saved in memory compared to average key > search as you define it. However, in the real world, some of the data could > behave fine but there could be some spikes which is what we want to ignore. > Sorry I didn't understand before. > > [snip] > > Just looking at the numbers, it is too straight. just divide the inserts by > bins and you get roughly the avg search. I don't believe it will even come > close to that in real life. Therefor i would define goals first and derive > the metrics from that. For example, i wish to save space and only on certain > cases allow resizing, but only if the avg search is over a certain value. > I don't believe this can be derived in advanced since real life are not > uniform. It was a hacked together simulation, so I'm not surprised. I had to change the hash function for my string-based hash table because it wasn't very good at creating uniform random output. After fixing the hash function to be more ASCII friendly, my real world numbers seem to be fairly balanced. For ~40,000 keys hashed into 256 bins, I get the following statistics: mean = 160 median = 159 standard deviation = 11.97 min = 129 max = 200 In another benchmark with ~522,000 keys hashed into 1024 bins, I get the following statistics: mean = 510 median = 510 standard deviation = 19.55 min = 450 max = 578 As I said, these look pretty balanced. > > While a non-scientific argument, I would also suggest that average > > search depth should be the resizing metric because a programmer is > > more likely to ask "How long does it take to get data out of the hash > > table?" rather than "How full is the hash table?" > > If it is compatible to your goal sure, however, if you have a small device, > your are also concerned with space (especially if you have many hashtables). True. > > Of course, all these computations are subject to the standard > > assumptions on uniform random input and hashes. Your results may > > vary, if you have different assumptions. > > Yes, i believe they will vary for non-uniform distributions which is what > happens in real life, therefor, i don't believe we can predict in advance for > all cases the optimal space/average search hashtable size. > > btw, what happens when you employ your suggested method of advancing to the > top most recently used entries for each key. It could widely improve. I uploaded a new version to my website that allows the most recently accessed keys to float to the top of the list. Basically, it splices the list out of the chain and places it on the head of the list after being accessed. I chose this technique because different keys are more popular during different phases of execution. When the popularity changes, the keys are reshuffled quickly. It works fairly well in practice. With 39,936 entries in a 256 bin table, I get an average search depth of 39 (expect ~78). With 522,240 entries in a 1024 bin table, average search depth is 132 (expect ~255). Dynamic resizing would definitely help because the 1024 bin table is seriously overloaded and causes a major performance drag, even on our lab's new 2 CPU, 8-core server. > I suggest perhaps a more tuned plan, where is, you assign each entry a hit > count, then, when you pass it over the chain you can increase an entry > retrieved. If the hit count is greater than the hit count of the parent entry > you do a switch with the parent. You can handle corner cases by resetting the > particular list or any other method will do (like dividing by 2^(maxbits of > counter/2)). For stable applications, the hit count would be very effective. When applications are dynamic, such as my case, I think it would take too long to re-sort the table to get the benefit. That is, by time the hit count increases, the program has already moved on to a different execution phase. A decay value would help, but it would also require traversing the entire table to apply it. -- Eddie Pettis Electrical and Computer Engineering Purdue University -- To unsubscribe from this list: send an email with "unsubscribe kernelnewbies" to ecartis@xxxxxxxxxxxx Please read the FAQ at http://kernelnewbies.org/FAQ
I uploaded a new version to my website that allows the most recently accessed keys to float to the top of the list. Basically, it splices the list out of the chain and places it on the head of the list after being accessed. I chose this technique because different keys are more popular during different phases of execution. When the popularity changes, the keys are reshuffled quickly.
How about using splay trees? They do exactly the same thing i.e. recently accessed elements are pushed to top. Basic operations like insertion, look-up and removal in O(log(n)). I don't know how you have implemented it using lists, but using splay trees minimizes book keeping also. -- Regards, Sandeep. -- To unsubscribe from this list: send an email with "unsubscribe kernelnewbies" to ecartis@xxxxxxxxxxxx Please read the FAQ at http://kernelnewbies.org/FAQ