On Sun, 12 Mar 2006, Junio C Hamano wrote: > > The code uses close to 16-bit hash and I had 65k flat array as a > hashtable. That one was what you commented as "4-times as many > page misses". Ahh. That explains the limited bits in the hash function too. I only looked at the current sources, not at the historic ones. Btw, the page misses may come from the fact that you allocated and re-allocated the flat array all the time. That can be very expensive for big allocations, since most libraries may decide that it's a big enough area that it should be map/unmap'ed in order to give memory back to the system (without realizing that there's another allocation coming soon afterwards of the same size). If you map/unmap, the kernel will end up having to not just use new pages, but obviously also clear them for security reasons. So it ends up sucking on many levels. In contrast, if you just have a 64k-entry array of "int", and allocate it _once_ (instead of once per file-pair) you'll still have to clear it in between file-pairs, but at least you won't have the overhead of mapping/unmapping it. The clearing can still be pretty expensive (64k "int" entries is 256kB, and since most _files_ are just in the ~4-8kB range, you're spending a _lot_ of time just memset'ing). Which is why it's probably a good idea to instead default to having just "filesize / 8" entries, but then you can obviously not use the hash as the index any more. Linus - : send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html