Re: [PATCH v4] unpack-trees: avoid duplicate ODB lookups during checkout

Jeff Hostetler <git@xxxxxxxxxxxxxxxxx> · Fri, 14 Apr 2017 17:06:05 -0400

On 4/14/2017 3:52 PM, Jeff King wrote:
On Fri, Apr 14, 2017 at 07:25:54PM +0000, git@xxxxxxxxxxxxxxxxx wrote:

 	for (i = 0; i < n; i++, dirmask >>= 1) {
-		const unsigned char *sha1 = NULL;
-		if (dirmask & 1)
-			sha1 = names[i].oid->hash;
-		buf[i] = fill_tree_descriptor(t+i, sha1);
+		if (i > 0 && are_same_oid(&names[i], &names[i - 1]))
+			t[i] = t[i - 1];
+		else if (i > 1 && are_same_oid(&names[i], &names[i - 2]))
+			t[i] = t[i - 2];
+		else {
+			const unsigned char *sha1 = NULL;
+			if (dirmask & 1)
+				sha1 = names[i].oid->hash;
+			buf[nr_buf++] = fill_tree_descriptor(t+i, sha1);
+		}

This looks fine to me.

Just musing (and I do not think we need to go further than your patch),
we're slowly walking towards an actual object-content cache. The "buf"
array is now essentially a cache of all oids we've loaded, but it
doesn't know its sha1s. So we could actually encapsulate all of the
caching:

  struct object_cache {
	  int nr_entries;
	  struct object_cache_entry {
		  struct object_id oid;
		  void *data;
	  } cache[MAX_UNPACK_TREES];
  };

and then ask it "have you seen oid X" rather than playing games with
looking at "i - 1". Of course it would have to do a linear search, so
the next step is to replace its array with a hashmap.

And now suddenly we have a reusable object-content cache that you could
use like:

  struct object_cache = {0};
  for (...) {
    /* maybe reads fresh, or maybe gets it from the cache */
    void *data = read_object_data_cached(&oid, &cache);
  }
  /* operation done, release the cache */
  clear_object_cache(&cache);

which would work anywhere you expect to load N objects and see some
overlap.

Of course it would be nicer still if this all just happened
automatically behind the scenes of read_object_data(). But it would have
to keep an _extra_ copy of each object, since the caller expects to be
able to free it. We'd probably have to return instead a struct with
buffer/size in it along with a reference counter.

I don't think any of that is worth it unless there are spots where we
really expect there to be a lot of cases where we hit the same objects
in rapid succession. I don't think there should be, though. Our usual
"caching" mechanism is to create a "struct object", which is enough to
perform most operations (and has a much smaller memory footprint).

So again, just musing. I think your patch is fine as-is.

Thanks for your help on this one.

I think before I tried to do a cache at this layer,
I would like to look at (or have a brave volunteer
look at) the recursive tree traversal.  In my Windows
tree I have 500K directories, so the full recursive
tree traversal touches them.  My change cuts the ODB
lookups (on a "checkout -b") from 1M to 500K (roughly),
but I still have to do 500K strcmp's to get those
savings.  What would be nice would be to have maybe
an alternate callback -- one which knows the peers
(and everything under them) are equal and let it short
cut as much as it can.  The alternate version of the
above routine would be able to avoid the strcmp's,
but I'm guessing that there would also be savings
when we look within a treenode -- the oidcmp's and
some of the n-way parallel sub-treenode-iteration.
I'm just swag'ing here, but there might be something
here.

Jeff