Re: git-diff-tree -M performance regression in 'next'

Linus Torvalds <torvalds@xxxxxxxx> · Sun, 12 Mar 2006 17:39:20 -0800 (PST)

On Sun, 12 Mar 2006, Junio C Hamano wrote:
> 
> The code uses close to 16-bit hash and I had 65k flat array as a
> hashtable.  That one was what you commented as "4-times as many
> page misses".

Ahh. That explains the limited bits in the hash function too. I only 
looked at the current sources, not at the historic ones.

Btw, the page misses may come from the fact that you allocated and 
re-allocated the flat array all the time. That can be very expensive for 
big allocations, since most libraries may decide that it's a big enough 
area that it should be map/unmap'ed in order to give memory back to the 
system (without realizing that there's another allocation coming soon 
afterwards of the same size).

If you map/unmap, the kernel will end up having to not just use new pages, 
but obviously also clear them for security reasons. So it ends up sucking 
on many levels. In contrast, if you just have a 64k-entry array of "int", 
and allocate it _once_ (instead of once per file-pair) you'll still have 
to clear it in between file-pairs, but at least you won't have the 
overhead of mapping/unmapping it.

The clearing can still be pretty expensive (64k "int" entries is 256kB, 
and since most _files_ are just in the ~4-8kB range, you're spending a 
_lot_ of time just memset'ing). Which is why it's probably a good idea to 
instead default to having just "filesize / 8" entries, but then you can 
obviously not use the hash as the index any more.

		Linus
-
: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html