On Monday 16 April 2007 03:20:42 Eddie Pettis wrote: > > My main concern and reason for percentage is that what happens if the > > keys of the entries in the buckets are not differentiated even in the > > next rebuild or 2 rebuilds. > > That's just a consequence of hashing in general. Well-designed > hashing algorithms are designed to generate a uniform random hash. > You can construct a malicious set of keys that create pathologically > bad behavior, but that scenario is unlikely in practice. > > In short, I wouldn't worry about this problem because you will > _on_average_ improve search depth by resizing. Depends on your goals. see below. > > > That could prove very wasteful, thus, the percentage is meant as > > a barrier to prevent too many rebuilds, which could waste space and lower > > performance. You can still use other metrics on top of it. I.e., you can > > set it at 50% and use average search depth, etc... > > I understand that you don't want to be rebuilding constantly. I ran a > few little experiments comparing the two metrics of percent fullness > and average search depth. > > Suppose you wanted the 67% full metric at the hlist_head level. Then, > you will need to resize after approximately the following number of > inserts (computed using a quick MATLAB simulation with the average of > 64 runs for each number of bins): I was referring to overall fullness of the data structure not to each key level. The goal was to the space saved in memory compared to average key search as you define it. However, in the real world, some of the data could behave fine but there could be some spikes which is what we want to ignore. > bins = 128 inserts = 171 avg search = 1.66 (after resizing) = 1.33 > bins = 256 inserts = 286 avg search = 1.56 (after resizing) = 1.28 > bins = 512 inserts = 575 avg search = 1.56 (after resizing) = 1.28 > bins = 1024 inserts = 1189 avg search = 1.58 (after resizing) = 1.29 > bins = 2048 inserts = 2284 avg search = 1.56 (after resizing) = 1.28 > > Suppose you wanted an average search of 1.6. You get the following > number of inserts (computed from average search depth formula): > > bins = 128 inserts = 154 avg search = 1.60 (after resizing) = 1.30 > bins = 256 inserts = 308 avg search = 1.60 (after resizing) = 1.30 > bins = 512 inserts = 615 avg search = 1.60 (after resizing) = 1.30 > bins = 1024 inserts = 1229 avg search = 1.60 (after resizing) = 1.30 > bins = 2048 inserts = 2458 avg search = 1.60 (after resizing) = 1.30 > > This is the same thing, just a different way of looking at it. Hence, > you can approximate the behavior of table fullness using average > search time. > > The maximum range that the fullness scheme can handle is 100% > fullness, which produces the following number of insertions: > > bins = 128 inserts = 788 avg search = 4.07 (after resizing) = 2.54 > bins = 256 inserts = 1275 avg search = 3.49 (after resizing) = 2.24 > bins = 512 inserts = 4184 avg search = 5.08 (after resizing) = 3.04 > bins = 1024 inserts = 8765 avg search = 5.28 (after resizing) = 3.14 > bins = 2048 inserts = 26668 avg search = 7.51 (after resizing) = 4.26 > > The fullness scheme has a nonlinear effect on how long it takes data > to be extracted from the hash table. This makes it difficult to > predict the hash table's performance both before and after the resize. > > In contrast, the average search time allows you to define a constant > performance level that is maintained as the table becomes larger. For > example, if you want to have an average search depth of 3, you can > analytically compute how large your table should be and resize only > when that constraint is violated, as shown below: > > bins = 128 inserts = 513 avg search = 3.00 (after resizing) = 2.00 > bins = 256 inserts = 1025 avg search = 3.00 (after resizing) = 2.00 > bins = 512 inserts = 2049 avg search = 3.00 (after resizing) = 2.00 > bins = 1024 inserts = 4097 avg search = 3.00 (after resizing) = 2.00 > bins = 2048 inserts = 8193 avg search = 3.00 (after resizing) = 2.00 Just looking at the numbers, it is too straight. just divide the inserts by bins and you get roughly the avg search. I don't believe it will even come close to that in real life. Therefor i would define goals first and derive the metrics from that. For example, i wish to save space and only on certain cases allow resizing, but only if the avg search is over a certain value. I don't believe this can be derived in advanced since real life are not uniform. > While a non-scientific argument, I would also suggest that average > search depth should be the resizing metric because a programmer is > more likely to ask "How long does it take to get data out of the hash > table?" rather than "How full is the hash table?" If it is compatible to your goal sure, however, if you have a small device, your are also concerned with space (especially if you have many hashtables). > Of course, all these computations are subject to the standard > assumptions on uniform random input and hashes. Your results may > vary, if you have different assumptions. Yes, i believe they will vary for non-uniform distributions which is what happens in real life, therefor, i don't believe we can predict in advance for all cases the optimal space/average search hashtable size. btw, what happens when you employ your suggested method of advancing to the top most recently used entries for each key. It could widely improve. I suggest perhaps a more tuned plan, where is, you assign each entry a hit count, then, when you pass it over the chain you can increase an entry retrieved. If the hit count is greater than the hit count of the parent entry you do a switch with the parent. You can handle corner cases by resetting the particular list or any other method will do (like dividing by 2^(maxbits of counter/2)). -- Regards, Tzahi. -- Tzahi Fadida Blog: http://tzahi.blogsite.org | Home Site: http://tzahi.webhop.info WARNING TO SPAMMERS: see at http://members.lycos.co.uk/my2nis/spamwarning.html -- To unsubscribe from this list: send an email with "unsubscribe kernelnewbies" to ecartis@xxxxxxxxxxxx Please read the FAQ at http://kernelnewbies.org/FAQ