Re: [PATCH] patch-ids.c: cache patch IDs in a notes tree

John Keeping <john@xxxxxxxxxxxxx> · Sun, 12 May 2013 09:59:34 +0100

On Sat, May 11, 2013 at 08:00:44PM -0700, Junio C Hamano wrote:
> Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> writes:
> 
> > On Sat, May 11, 2013 at 2:49 PM, John Keeping <john@xxxxxxxxxxxxx> wrote:
> >>
> >> Hmm... I hadn't realised that.  Looking a bit closer, it looks like
> >> init_patch_ids sets up its own diffopts so its not affected by the
> >> command line (except for pathspecs which would be easy to check for).
> >> Of course that still means it can be affected by settings in the user's
> >> configuration.
> >
> > .. and in the actual diff algorithm.
> 
> As to the "objection" side of the argument, I already said
> essentially the same thing several months ago:
> 
>   http://thread.gmane.org/gmane.comp.version-control.git/202654/focus=202898
> 
> and do not have much to add [*1*].
> 
> However.
> 
> The use of patch-id in cherry and rebase is to facilitate avoiding
> to replay commits that are obviously identical to the ones you have
> in your history.  The cached patch id for an existing old commit may
> differ from a patch id you freshly compute for a new commit you are
> trying to see if it truly new, even though they may represent the
> same change.  So we may incorrectly think such a new commit is not
> yet in your history and attempt to replay it.
> 
> But it is not a big problem.  Either 3-way merge notices that there
> is nothing new, or you get a conflict and have chance to inspect
> what is going on.

It's not a problem here, but false negatives would be annoying if you're
looking at "git log --cherry-mark".

> A conceptually much larger and more problematic issue is that we may
> discard a truly new change that you still need as an old one you
> already have due to a hash collision and discard it.  Because the
> hash space of SHA-1 is so large, however, it is not a problem in
> practice, and more importantly, that hash space is just as large as
> the hash space used by Git to reduce a patch to a patch id, the
> filtering done with patch-id in cherry and rebase _already_ have
> that exact problem with or without this additional cache layer. A
> stale cache may make the possibility of lost change due to such a
> hash collision merely twice as likely.
> 
> > ... it's a "the patch ID actually ignores a lot of data in order
> > to give the same ID even if lins have been added above it, and the
> > patch is at different line numbers etc".
> 
> Yes.
> 
> > So maybe it doesn't matter. But at the same time, I really think
> > caching patch ID's should be something people should be aware of is
> > fundamentally wrong, even if it might work.
> 
> I do not think it is "caching patch ID" that people should be aware
> of is fundamentally wrong.  What is fundamentally wrong, even if it
> might work, is "using patch ID" itself.
> 
> > And quite frankly, if you do rebases etc so much that you think patch
> > ID's are so important that they need to be cached, you may be doing
> > odd/wrong things.
> 
> And that, too ;-)

I've never noticed a problem with rebases, it's when I use "git log
--cherry master..." to see if patches I've sent to a mailing list have
been picked up.

To take Git as an example (albeit a bad one because "What's Cooking" is
a more useful way to track patch state here), if I compare this patch to
pu I have:

	$ git rev-list --left-right --count pu...
	234	1

and caching patch IDs takes that from ~0.6s to ~0.1s.  When doing that
over several branches consecutively that makes a big difference to the
overall runtime, especially because most of the commits of interest will
be cached during the first one.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html