Re: Histogram/patience diff matching lines with different counts

Jonathan Tan <jonathantanmy@xxxxxxxxxx> · Fri, 10 Jan 2025 11:33:53 -0800

Martin von Zweigbergk <martinvonz@xxxxxxxxx> writes:
> After that, "c" is unique on the left side but has a different count
> (namely 2) on the right side, so I would have thought that it should
> not be considered matching. Does anyone know if it's implemented this
> way on purpose? 

The purpose, if any, might be lost to history. The implementation of
histogram diff in Git seems to be a port from JGit (8c912eea94 (teach
--histogram to diff, 2011-07-12)). The one in JGit [1] seems to be an
original extension of patience diff by Shawn Pearce, who has passed away
a few years ago.

The class documentation comment in [1] does not give any rationale for
or against rejecting lines with different counts, but quoting from it:

> * By always selecting a LCS position with the lowest occurrence count, this
> * algorithm behaves exactly like Bram Cohen's patience diff whenever there is a
> * unique common element available between the two sequences. When no unique
> * elements exist, the lowest occurrence element is chosen instead. This offers
> * more readable diffs than simply falling back on the standard Myers' O(ND)
> * algorithm would produce.

I think it makes sense to reject lines with different counts, just like
how jj does it today, since the original motivation for low-occurring
lines in both the patience diff and the histogram diff algorithms was
to keep high-signal lines (i.e. not lines such as "}" and "return;") as
context (instead of + or -), and if the count of a line differs, it is
probably not a high-signal line in the first place.

I don't think it's worth changing it now, though, especially in Git. In
the scenario you describe, even if we change Git to reject lines with
different counts, failing to find a matching line means we fall back to
Myers, which matches up the only "c" on the left and the first "c" on
the right anyway (so in the end, we still won't get the result that you
might want - reporting that the whole block has changed). (In jj's case,
in which there is no fall back to Myers, I think it's reasonable to make
histogram diff work only with non-different counts, since the lack of
a fall back will indeed mean that we report that the whole block has
changed. This sounds like the "highly ambiguous" case that Bram Cohen,
the inventor of patience diff, mentions in [3].)

To further complicate things, in both JGit and Git, a line with
different counts is not considered matching only if the count in
"A" (the left hand side) is greater than the count in "B" [2]. I can't
think of a reason for this asymmetry, and the class documentation
comment in [1] doesn't explain that either.

[1] https://eclipse.googlesource.com/jgit/jgit/+/refs/heads/master/org.eclipse.jgit/src/org/eclipse/jgit/diff/HistogramDiff.java
[2] https://eclipse.googlesource.com/jgit/jgit/+/refs/heads/master/org.eclipse.jgit/src/org/eclipse/jgit/diff/HistogramDiffIndex.java#206
[3] https://lore.kernel.org/git/alpine.DEB.1.00.0902052113590.7491@intel-tinevez-2-302/

> As some of you know, I work on the Jujutsu/jj VCS
> (https://github.com/jj-vcs/jj). We also use histogram diff (and only
> histogram diff) and actually allowed matching up lines with different
> counts a while ago, but I thought it seemed too arbitrary to line up
> the first matches if there were different counts, so we changed that.
> Then we got a report from a user that Git behaves differently. See
> https://github.com/jj-vcs/jj/issues/761#issuecomment-2581219294 for
> more details.
> 
> Thanks

I think that it is unavoidable that different VCSes may produce
different diffs. Even in Git itself, there are many options (including
which algorithm to use) that can change the nature of the diff produced.