Re: [PATCH 2/2] xdiff: implement empty line chunk heuristic

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



[your original probably didn't make it to the list because of its 5MB
 attachment; the list has a 100K limit; I'll try to quote liberally]

On Tue, Apr 19, 2016 at 04:17:50PM -0700, Jacob Keller wrote:

> I ran this version of the patch against the entire Linux kernel
> history, as I figured this has a large batch of C code to try and spot
> any issues.
> 
> I ran something like the following command in bash
> 
> $git rev-list HEAD | while read -r rev; do diff -F ^commit -u <(git
> show --format="commit %H" --no-compaction-heuristic $rev) <(git show
> --format="commit %H" --compaction-heuristic $rev); done >
> heuristic.patch

My earlier tests with the perl script were all done with "git log -p",
which will not show anything at all for merges (and my script wouldn't
know how to deal with combined diffs anyway). But I think this new patch
_will_ kick in for combined diffs (because it is built on individual
diffs). It will be interesting to see if this has any effect there, and
what it looks like.

We should be able to see it (on a small enough repository) with:

  git log --format='commit %H' --cc --merges

and comparing the before/after.

> I've attached the file that I generated for the Linux history, it's
> rather large so hopefully I can get some help to spot any differences.
> The above approach will work for pretty much any repository, and works
> better than trying to generate the entire thing first and then diff
> (since that runs out of memory pretty fast).

I don't think there is much point in generating a complete diff between
the patches for every commit, when nobody can look at the whole thing.
Unless we have automated tooling to find "interesting" bits (and
certainly a tool to remove the boring "a comment got shifted by one"
lines would help; those are all known improvements, but it's the _other_
stuff we want to look).

But if we are not using automated tooling to find the needle in the
haystack, we might as well using sampling to make the dataset more
manageable. Adding "--since=1.year.ago" is one way, though we may want
to sample more randomly across time.

> So far, I haven't spotted anything that would want me to disable it,
> while I've spotted several cases where I felt that readability was
> improved. It's somewhat difficult to spot though.

I did find one case that I think is worse. Look at 857942fd1a in the
kernel. It has a pattern like this:

  ... surrounding code ...

  function_one();
  ... more surrounding code ...

which becomes:

  ... surrounding code ...

  function_two();

  ... more surrounding code

Without the new heuristic, that looks like:

  -function_one();
  +function_two();
  +

but with it, it becomes:

  +
  +function_two();

  -function_one();

which is kind of weird. Having the two directly next to each other reads
better to me. This is a pretty unusual diff, though, in that it did
change the surrounding whitespace (and if you look further in the diff,
the identical change is made elsewhere _without_ touching the
whitespace). So this is kind of an anomaly. And IMHO the weirdness here
is outweighed by the vast number of improvements elsewhere.

-Peff
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html



[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]