Re: [PATCH] patch-ids.c: cache patch IDs in a notes tree

Junio C Hamano <gitster@xxxxxxxxx> · Sat, 11 May 2013 20:00:44 -0700

Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> writes:

> On Sat, May 11, 2013 at 2:49 PM, John Keeping <john@xxxxxxxxxxxxx> wrote:
>>
>> Hmm... I hadn't realised that.  Looking a bit closer, it looks like
>> init_patch_ids sets up its own diffopts so its not affected by the
>> command line (except for pathspecs which would be easy to check for).
>> Of course that still means it can be affected by settings in the user's
>> configuration.
>
> .. and in the actual diff algorithm.

As to the "objection" side of the argument, I already said
essentially the same thing several months ago:

  http://thread.gmane.org/gmane.comp.version-control.git/202654/focus=202898

and do not have much to add [*1*].

However.

The use of patch-id in cherry and rebase is to facilitate avoiding
to replay commits that are obviously identical to the ones you have
in your history.  The cached patch id for an existing old commit may
differ from a patch id you freshly compute for a new commit you are
trying to see if it truly new, even though they may represent the
same change.  So we may incorrectly think such a new commit is not
yet in your history and attempt to replay it.

But it is not a big problem.  Either 3-way merge notices that there
is nothing new, or you get a conflict and have chance to inspect
what is going on.

A conceptually much larger and more problematic issue is that we may
discard a truly new change that you still need as an old one you
already have due to a hash collision and discard it.  Because the
hash space of SHA-1 is so large, however, it is not a problem in
practice, and more importantly, that hash space is just as large as
the hash space used by Git to reduce a patch to a patch id, the
filtering done with patch-id in cherry and rebase _already_ have
that exact problem with or without this additional cache layer. A
stale cache may make the possibility of lost change due to such a
hash collision merely twice as likely.

> ... it's a "the patch ID actually ignores a lot of data in order
> to give the same ID even if lins have been added above it, and the
> patch is at different line numbers etc".

Yes.

> So maybe it doesn't matter. But at the same time, I really think
> caching patch ID's should be something people should be aware of is
> fundamentally wrong, even if it might work.

I do not think it is "caching patch ID" that people should be aware
of is fundamentally wrong.  What is fundamentally wrong, even if it
might work, is "using patch ID" itself.

> And quite frankly, if you do rebases etc so much that you think patch
> ID's are so important that they need to be cached, you may be doing
> odd/wrong things.

And that, too ;-)

[Footnote]

*1* For people listening from the sidelines, the fact that Git
algorithm can improve over time is a real issue, and and has caused
one issue that still hasn't been solved in the k.org upload process.
Somebody who has a repository there could *theoretically*:

 - push her v1.1 release via Git ("git push origin v1.1");
 - create a tarball ("git archive -o v1.1.tar v1.1") and diff since the
   last release ("git diff v1.0 v1.1 >v1.0-v1.1.diff") locally;
 - GPG sign them ("gpg -b v1.1.tar", "gpg -b v1.0-v1.1.diff"); and
 - upload only the signature files

and have k.org create the tarballs and diff to save bandwidth of
uploading logically derivable stuff over and over again.  But that
can be done only when output from "git archive" and "git diff" are
stable, which is not the case.  We changed how extended header fields
are used in the tar archive for a long pathname recently, and also
we changed use of XDF_NEED_MINIMAL a couple of years ago in "git diff";
both of these affect the output.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html