Re: [PATCH v2 04/20] merge-ort: use histogram diff

Elijah Newren <newren@xxxxxxxxx> · Wed, 11 Nov 2020 08:47:20 -0800

On Wed, Nov 11, 2020 at 5:54 AM Derrick Stolee <stolee@xxxxxxxxx> wrote:
>
> On 11/2/2020 3:43 PM, Elijah Newren wrote:
> > I have some ideas for using a histogram diff to improve content merges,
> > which fundamentally relies on the idea of a histogram.  Since the diffs
> > are never displayed to the user but just used internally for merging,
> > the typical user preference shouldn't matter anyway, and I want to make
> > sure that all my testing works with this algorithm.
> >
> > Granted, I don't yet know if those ideas will pan out and I haven't even
> > tried any of them out yet, but it's easy to change the diff algorithm in
> > the future if needed or wanted.  For now, just set it to histogram.
>
> If you are not making use of the histogram yet, then could you set this
> patch aside until you _do_ use it? Or are there performance implications
> that are also a side benefit?

Long story...

git folks tend to value performance pretty strongly -- including
sometimes valuing it OVER correctness.  For example, if fast-export
completely munges some merge commits and you send your first ever
patch in to the list to fix it (by turning on topo_order), you might
run into folks asking if it should be made an option so that we don't
slow down exports except for people who happen to know they need it
(and thus risk breaking exports for people who happen to be unaware
that they need it...).  Luckily, Peff bailed me out in that situation
by doing some timings and finding that topo_order actually made
fast-export _faster_, to everyone's surprise at the time.  Or, to take
another example, perhaps someone will introduce some commit-date
cutoff on the revision walking machinery that sometimes breaks the
answer but makes things faster...and then causes a bunch of headaches
years down the road when someone tries to introduce commit-graphs to
get us always correct _and_ fast answers.

In this case, histogram diffs in my cursory investigation are about 2%
slower than Myers diffs.  I think others may have done even more
detailed benchmarks.  They've been around for years, but haven't been
made the default, despite giving obviously better looking diffs to
users in a number of cases where Myers diffs are unintelligible.  But,
far more importantly, there are real merge bugs we know about that are
even affecting git.git and linux.git that I don't have a clue how to
address without the additional information that I believe is provided
by histogram diffs.  See the following:

https://lore.kernel.org/git/20190816184051.GB13894@xxxxxxxxxxxxxxxxxxxxx/
https://lore.kernel.org/git/CABPp-BHvJHpSJT7sdFwfNcPn_sOXwJi3=o14qjZS3M8Rzcxe2A@xxxxxxxxxxxxxx/
https://lore.kernel.org/git/CABPp-BGtez4qjbtFT1hQoREfcJPmk9MzjhY5eEq1QhXT23tFOw@xxxxxxxxxxxxxx/

I don't like mismerges.  I really don't like silent mismerges.  While
I am sometimes willing to make performance and correctness tradeoff,
I'm much more interested in correctness in general.  I want to fix the
above bugs.  I have not yet started doing so, but I believe histogram
diff at least gives me an angle.  But I can't rely on using the
information from histogram diff unless it's in use.  And it hasn't
been used because of a few percentage performance hit.

But, since I happen to be speeding up typical non-trivial
merges/rebases/cherry-picks by factors of 10 or more (at least,
anywhere that read-and-update-the-index is a trivial percentage of
overall time instead of 99.9% of overall time like it is for you), now
is golden opportunity to switch out the underlying diff algorithm so
that I can get the data I need to fix the bugs I know are there.
Whether I can actually fix them is yet to be seen; I won't even start
until merge-ort is complete and merged.

Does that help?