Re: [PATCH 2/2] diff: teach diff to read gitattribute diff-algorithm

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Elijah

On 10/02/2023 09:57, Elijah Newren wrote:
Hi Phillip,

On Thu, Feb 9, 2023 at 6:44 AM Phillip Wood <phillip.wood123@xxxxxxxxx> wrote:

Hi Elijah

On 09/02/2023 09:09, Elijah Newren wrote:
I'd support a change to either patience or histogram as the default
algorithm. My personal preference would be for the patience algorithm as
I think it generally gives nicer diffs in the cases that the two
disagree (see below, I've tried changing diff.algorithm to histogram a
few times and I always end up changing it back to patience pretty
quickly). However I can see there is an advantage in having "diff" and
"merge" use the same algorithm as users who diffing either side to the
merge base will see the same diff that the merge is using. The histogram
algorithm is known to produce sub-optimal diffs in certain cases[1] but
I'm not sure how much worse it is in that respect than any of the other
algorithms.
[...]
[1]
https://lore.kernel.org/git/CAGZ79kZYO6hHiAM8Sfp3J=VX11c=0-7YDSx3_EAKt5-uvvt-Ew@xxxxxxxxxxxxxx/

Thanks, I might have a fix, though I'm a bit worried my tweaks might
trigger issues elsewhere or cost a bit of performance; I'll need to
test.  Are there any other good known testcases where histogram
produces sub-optimal diffs?

Not that I'm aware of (I've a feeling there might have been something on the JGit mailing list but I only managed to find a copy of Stefan's message). Loosely related is [1] which talks about hash collisions and I've never found the time to look at properly. I suspect any hash collision problem is more likely to affect xdl_classify_record() which is used by all the algorithms.

To see the differences between the output of patience and histogram
algorithms I diffed the output of "git log -p --no-merges
--diff-algorithm=patience" and "git log -p --no-merges
--diff-algorithm=histogram". The first three differences are

- 6c065f72b8 (http: support CURLOPT_PROTOCOLS_STR, 2023-01-16)
    In get_curl_allowed_protocols() the patience algorithm shows the
    change in the return statement more clearly

- 47cfc9bd7d (attr: add flag `--source` to work with tree-ish, 2023-01-14)
     The histogram algorithm shows read_attr_from_index() being moved
     whereas the patience algorithm does not making the diff easier to
     follow.

- b0226007f0 (fsmonitor: eliminate call to deprecated FSEventStream
function, 2022-12-14)
    In fsm_listen__stop_async() the histogram algorithm shows
    data->shutdown_style = SHUTDOWN_EVENT;
    being moved, which is not as clear as the patience output which
    shows it as a context line.

If my current changes are "good", then they also remove the
differences between patience and histogram for the second and third
commits above.  (And the differences between the two algorithms for
the first commit look really minor.)

Interesting. I agree the differences for the first commit are small. Interestingly I think they come from patience algorithm falling back to the myers implementation because it cannot find any unique context lines (I have a patch that removes the fallback[2] and it gives the same result as the histogram implementation).

Best Wishes

Phillip

[1] https://lore.kernel.org/git/4e0eff48-4a3e-4f0e-9ed2-d01ec38442a5@xxxxxxxxxxxxxxxx/

[2] https://github.com/phillipwood/git/commits/pure-patience-diff



[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux