Re: [PATCH v1] unpack-trees: avoid duplicate ODB lookups during checkout

Jeff Hostetler <git@xxxxxxxxxxxxxxxxx> · Fri, 7 Apr 2017 09:51:40 -0400

On 4/7/2017 1:19 AM, Jeff King wrote:
On Thu, Apr 06, 2017 at 03:48:07PM -0700, Stefan Beller wrote:

On Thu, Apr 6, 2017 at 1:37 PM,  <git@xxxxxxxxxxxxxxxxx> wrote:
From: Jeff Hostetler <jeffhost@xxxxxxxxxxxxx>

Teach traverse_trees_recursive() to not do redundant ODB
lookups when both directories refer to the same OID.

And the reason for this is that omitting the second lookup
saves time, i.e. a lookup in the ODB of a sufficiently large
repo is slow.

My kneejerk line of thinking:
* yes, it sounds good to reduce the number of ODB accesses.
* But if we consider ODB lookups to be slow and we perform
  a structured access, how about a cache in front of the ODB?
* We already have that! (sort of..) 9a414486d9 (lookup_object:
  prioritize recently found objects, 2013-05-01)
* Instead of improving the caching, maybe change the
  size of the problem: We could keep the objects of different types
  in different hash-tables.

I don't think this is using lookup_object() at all, though. It goes
straight to fill_tree_descriptor(), which will read the object contents
from disk. So our time is going to:

  1. Finding the object in the odb (ideally a binary search in a single
     pack index, but less optimal when there are many packs).

  2. Reconstructing the object. This means zlib-inflating from disk, but
     it may also mean delta reconstruction.

I think there _are_ some caches at play here, though, when you look up
the same tree back to back. The pack-mru means that we'll always look in
the correct pack first. And in theory the delta base cache means that
we'll already have the whole thing reconstructed in memory (though I
have often been confused by exactly when we put items into that cache,
so I might be wrong).

This change shaved 500K calls to fill_tree_descriptor() (on the Windows
source tree on a "checkout -b" to the same commit).  I was surprised it
didn't give more of a speed up, so some of that caching may be in play
here, but it's hard to tell.

Also, on my repo I have 100GB+ of packfiles, so that may be messing
with things a bit.

So in theory, this is not all that different than the "just allocate and
copy the bytes" optimization that's happening here (though I'm not
surprised that doing it at a higher level can produce some speedup).

I think the more interesting optimization is "just use the same buffer
without bothering to copy", which is hard for the low-level code to do
(since it doesn't know what lifetime the caller is expecting).

My first draft did just borrow the buffer right there in traverse_
and it seemed to work just fine.  I was being cautious and copied it
properly in the lower layer.  Since the bufs are freed at the bottom
it felt safe, but I didn't want to overlook anything.  I'll switch it
back.

object.c has its own hash table, I presume for historical and
performance reasons, this would be split up to multiple hash
tables.

So I don't think lookup_object() is really relevant here. But I'm also
not sure that multiple hash tables would really buy us much. In theory
hash tables are O(1), so multiple smaller tables doesn't help (and might
hurt, since now we have four O(1) lookups to do). Of course that's a
minor fiction because of collisions, but we keep the load factor on the
object.c table pretty low (and that's why things like quadratic probing
and cuckoo hashing never showed great improvement).

I do wonder now about the initial hash table size and any limits
on it, but that is a question for another day.

Thanks
Jeff