Andrew Haley writes: > Christian Thalinger writes: > > Hi! > > > > We had a curious problem with SPECjvm98 _213_javac and cacao. A newly > > recompiled cacao ended with 10% preformance degradation. After further > > investigation we ended with VMSystem.identityHashCode() and it's return > > value. > > > > _213_javac make heavy use of Hashtable's and the objects stored in the > > hashtable use VMSystem.identityHashCode() as their hash function. Since > > alignment is normally 8 or 16-bytes, the distribution of the objects > > seems to be not optimal. We ended with 40000 Object.equals() calls more > > and a doubled runtime for Hashtable.get() (measured clock cycles), as > > Object.equals() is used in Hashtable.get(). > > This seems a bit strange. Hashtable.rehash() seems always to generate > a new capacity based on the old capacity * 2 + 1, so you wouldn't > expect to see a great many collisions as the definition of hash() is > > key.hashCode() % buckets.length > > Okay, buckets.length isn't necessarily prime, but you wouldn't expect > this to have a very non-uniform distribution. > > > A shifting of the VMSystem.identityHashCode() pointer value helps > > (sometimes), but shifting by 3 makes it even worse (what was my guess to > > make it better). > > > > Any thoughts? Should we change Hashtable.hash()? Or return a better > > value for VMSystem.identityHashCode()? But most runtimes only return > > the memory pointer... > > There are things that we can try, such a using a better way to turn an > address into a hash. But it would be nice first to know the real > cause of the very non-uniform distribution, if indeed that's what is > happening. A bit more thinking: the hash table size is guranteed never to be a multiple of a power of two, so the fact that the addresses always are multiples of power of two shouldn't matter. After a bit of experimenting, I get 1651956 collisions per 10716023 probes (about 16%) which is not far from what Knuth expects for an unsuccessful lookup with a truly random hash function and a half-full table. This percentage didn't improve when I used a scrambling function to do the hashing: in fact it got slightly worse, but that might be in the noise. unsigned long addr = (unsigned long)obj; unsigned long rotate = (addr >> 5 | addr << 27); return (addr ^ rotate) & 0x7fffffff; So, Twisti: what proportion of hash table probles generate collisions when you do the experiment? Andrew.