Re: [RFC/PATCH] patch-ids: check modified paths before calculating diff

Jeff King <peff@xxxxxxxx> · Wed, 29 May 2013 02:20:07 -0400

On Sun, May 19, 2013 at 02:17:35PM +0100, John Keeping wrote:

> When using "git cherry" or "git log --cherry-pick" we often have a small
> number of commits on one side and a large number on the other.  In
> revision.c::cherry_pick_list we store the patch IDs for the small side
> before comparing the large side to this.
> 
> In this case, it is likely that only a small number of paths are touched
> by the commits on the smaller side of the range and by storing these we
> can ignore many commits on the other side before unpacking blobs and
> diffing them.

There are two observations that make your scheme work:

  1. For something like --cherry-pick, we do not even care about the
     actual patch-ids, only whether there are matches from the left and
     right sides.

  2. Comparing the sets of paths touched by two commits is often
     sufficient to realize they do not have the same patch-ids.

But I think those observations mean we can do even better than
calculating the real patch-id for the small side at all.

Imagine we have both "loose" and "strict" patch-ids, where the loose one
is much faster to compute, and has that property that a match _might_
mean we have the same strict patch-id, but a non-match means we
definitely do not have the same strict patch-id.

I think such a loose patch-id could just be a hash of the filenames that
were changed by the patch (e.g., the first 32-bits of the sha1 of the
concatenated filenames). Computing that should be about as expensive as
a tree-diff. Per observation 2 above, if two commits do not have the
same loose id, we know that they cannot possibly have the same strict
id.

Then we can forget about the smaller-side and bigger-side entirely, and
just do something like:

  1. Make a sorted list (or hash table) of loose ids for one side.

  2. For each commit on the other side, calculate its loose id and look
     that up in the sorted list. If no hits, we know that there is no
     match. For any hits, lazily calculate (and cache) the strict patch
     id for both sides and compare as usual.

In the best case, we compute no patch-ids at all. And even for the
average case, I'd expect our lazy calculation to only have to compute a
handful of ids.

-Peff
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html