On Sun, Feb 25, 2024 at 05:01:19AM +0000, Matthew Wilcox wrote: > On Sat, Feb 24, 2024 at 10:18:31PM -0500, Kent Overstreet wrote: > > On Sat, Feb 24, 2024 at 10:10:27PM +0000, David Laight wrote: > > > I remember playing around with the elf symbol table for a browser > > > and all its shared libraries. > > > While the hash function is pretty trivial, it really didn't matter > > > whether you divided 2^n, 2^n-1 or 'the prime below 2^n' some hash > > > chains were always long. > > > > that's a pretty bad hash, even golden ratio hash would be better, but > > still bad; you really should be using at least jhash. > > There's a "fun" effect; essentially the "biased observer" effect which > leads students to erroneously conclude that the majority of classes are > oversubscribed. As somebody observed in this thread, for some usecases > you only look up hashes which actually exist. > > Task a trivial example where you have four entries unevenly distributed > between two buckets, three in one bucket and one in the other. Now 3/4 > of your lookups hit in one bucket and 1/4 in the other bucket. > Obviously it's not as pronounced if you have 1000 buckets with 1000 > entries randomly distributed between the buckets. But that distribution > is not nearly as even as you might expect: > > $ ./distrib > 0: 362 > 1: 371 > 2: 193 > 3: 57 > 4: 13 > 5: 4 > > That's using lrand48() to decide which bucket to use, so not even a > "quality of hash" problem, just a "your mathematical intuition may not > be right here". well, golden ratio hash - hash_32(i, 32) 0: 368 1: 264 2: 368 3: 0 but your distribution actually is accurate in general, golden ratio hash is relly nice for sequential integers. the actual problem with your test is that you're testing 100% occupancy - no one does that. 75% occupancy, siphash: 0: 933 1: 60 2: 6 3: 1 4: 0 that looks about right to me.