On Sat, Feb 24, 2024 at 10:18:31PM -0500, Kent Overstreet wrote: > On Sat, Feb 24, 2024 at 10:10:27PM +0000, David Laight wrote: > > I remember playing around with the elf symbol table for a browser > > and all its shared libraries. > > While the hash function is pretty trivial, it really didn't matter > > whether you divided 2^n, 2^n-1 or 'the prime below 2^n' some hash > > chains were always long. > > that's a pretty bad hash, even golden ratio hash would be better, but > still bad; you really should be using at least jhash. There's a "fun" effect; essentially the "biased observer" effect which leads students to erroneously conclude that the majority of classes are oversubscribed. As somebody observed in this thread, for some usecases you only look up hashes which actually exist. Task a trivial example where you have four entries unevenly distributed between two buckets, three in one bucket and one in the other. Now 3/4 of your lookups hit in one bucket and 1/4 in the other bucket. Obviously it's not as pronounced if you have 1000 buckets with 1000 entries randomly distributed between the buckets. But that distribution is not nearly as even as you might expect: $ ./distrib 0: 362 1: 371 2: 193 3: 57 4: 13 5: 4 That's using lrand48() to decide which bucket to use, so not even a "quality of hash" problem, just a "your mathematical intuition may not be right here". To put this data another way, 371 entries are in a bucket with a single entry, 384 are in a bucket with two entries, 171 are in a 3-entry bucket, 52 are in a 4-entry bucket and 20 are in a 5-entry bucket. $ cat distrib.c #define _GNU_SOURCE #include <stdio.h> #include <stdlib.h> int bucket[1000]; int freq[10]; int main(int argc, char **argv) { int i; for (i = 0; i < 1000; i++) bucket[lrand48() % 1000]++; for (i = 0; i < 1000; i++) freq[bucket[i]]++; for (i = 0; i < 10; i++) printf("%d: %d\n", i, freq[i]); return 0; } (ok, quibble about "well, 1000 doesn't divide INT_MAX evenly so your random number generation is biased", but i maintain that will not materially affect these results due to it affecting only 0.00003% of numbers generated)