Re: [PATCH 0/2] Make diff plumbing commands respect the indentHeuristic.

Ævar Arnfjörð Bjarmason <avarab@xxxxxxxxx> · Mon, 1 May 2017 12:34:38 +0200

On Sat, Apr 29, 2017 at 12:04 AM, Jeff King <peff@xxxxxxxx> wrote:
> On Fri, Apr 28, 2017 at 10:34:15AM -0700, Stefan Beller wrote:
>
>> > So instead I chose to make the indentHeuristic option part of diff's basic
>> > configuration, and in each of the diff plumbing commands I moved the call to
>> > git_config() before the call to init_revisions().
>> [...]
>>
>> The feature was included in v2.11 (released 2016-11-29) and we got no
>> negative feedback. Quite the opposite, all feedback we got, was positive.
>> This could be explained by having the feature as an experimental feature
>> and users who would be broken by it, did not test it yet or did not speak up.
>
> Yeah, if the point you're trying to make is "nobody will be mad if this
> is turned on by default", I don't think shipping it as a config option
> is very conclusive.
>
> I think a more interesting argument is: we turned on renames by default
> a few versions ago, which changes the diff in a much bigger way, and
> nobody complained.
>
>   As a side note, I do happen to know of one program that depends
>   heavily on diffs remaining stable. Imagine you have a Git hosting site
>   which lets people comment on lines of diffs. You need some way to
>   address the lines of the diff so that the annotations appear in the
>   correct position when you regenerate the diff later.
>
>   One way to do it is to just position the comment at the n'th line of
>   the diff. Which obviously breaks if the diff changes. IMHO that is a
>   bug in that program, which should be fixed to use the line numbers
>   from the original blob (which is still not foolproof, because a
>   different diff algorithm may move a change such that the line isn't
>   even part of the diff anymore).
>
>   I'm not worried about this particular program, as I happen to know it
>   has already been fixed. But it's possible others have made a similar
>   mistake.
>
>> So I'd propose to turn it on by default and anyone negatively impacted by that
>> could then use the config to turn it off for themselves (including plumbing).
>>
>> Something like this, maybe?
>
> Yeah, as long as this is on top of Marc's patches, I think it is the
> natural conclusion that we had planned.
>
> I don't know if we would want to be extra paranoid about patch-ids.
> There is no helping:
>
>   git rev-list HEAD | git diff-tree --stdin -p | git patch-id --stable
>
> because diff-tree doesn't know that it's trying for "--stable" output.
> But the diffs we compute internally for patch-id could disable the
> heuristics. I'm not sure if those matter, though. AFAIK those are used
> only for internal comparisons within a single program. I.e., we never
> compare them against input from the user, nor do we output them to the
> user. So they'll change, but I don't think anybody would care.

I have a few-million row table with commit_id as one column & patch_id
as another. I.e. a commit -> patch_id mapping.

This is used for an hourly reporting system of sorts, i.e. you can see
that you were working on commit X on Friday. It uses the patch-id to
de-duplicate that commit against some commit Y you may have pushed to
a topic branch, and already used for filling in hours.

I.e. it's a thing do DWYM in a rebase-heavy workflow where the commit
ids change, and where you're too lazy to have deleted the topic branch
afterwards (because they get pruned anyway).

It's fine for me if the patch-id changes, for most diffs it'll be the
same, and if not it'll auto-adjust eventually since the records I'm
inserting will all use the new algorithm.

But as the diff algorithm changes this system will start presenting
both X and Y in the UI as not-duplicate, because it'll have computed
the patch id of X with an older version of git, and the id of Y with
the new one.

I'm surely not the only one who's written something like this. Just
wanted to point out that systems like this do exist, I don't know of
any easier way with git to get a "is this rebased commit the same"
than patch-id, and for performance reasons you may be comparing patch
ids across git versions.