Re: [PATCH 2/2] xdiff: implement empty line chunk heuristic

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Tue, Apr 19, 2016 at 9:18 PM, Jeff King <peff@xxxxxxxx> wrote:
> [your original probably didn't make it to the list because of its 5MB
>  attachment; the list has a 100K limit; I'll try to quote liberally]
>
> On Tue, Apr 19, 2016 at 04:17:50PM -0700, Jacob Keller wrote:
>
>> I ran this version of the patch against the entire Linux kernel
>> history, as I figured this has a large batch of C code to try and spot
>> any issues.
>>
>> I ran something like the following command in bash
>>
>> $git rev-list HEAD | while read -r rev; do diff -F ^commit -u <(git
>> show --format="commit %H" --no-compaction-heuristic $rev) <(git show
>> --format="commit %H" --compaction-heuristic $rev); done >
>> heuristic.patch
>
> My earlier tests with the perl script were all done with "git log -p",
> which will not show anything at all for merges (and my script wouldn't
> know how to deal with combined diffs anyway). But I think this new patch
> _will_ kick in for combined diffs (because it is built on individual
> diffs). It will be interesting to see if this has any effect there, and
> what it looks like.
>
> We should be able to see it (on a small enough repository) with:
>
>   git log --format='commit %H' --cc --merges
>
> and comparing the before/after.
>
>> I've attached the file that I generated for the Linux history, it's
>> rather large so hopefully I can get some help to spot any differences.
>> The above approach will work for pretty much any repository, and works
>> better than trying to generate the entire thing first and then diff
>> (since that runs out of memory pretty fast).
>
> I don't think there is much point in generating a complete diff between
> the patches for every commit, when nobody can look at the whole thing.
> Unless we have automated tooling to find "interesting" bits (and
> certainly a tool to remove the boring "a comment got shifted by one"
> lines would help; those are all known improvements, but it's the _other_
> stuff we want to look).
>
> But if we are not using automated tooling to find the needle in the
> haystack, we might as well using sampling to make the dataset more
> manageable. Adding "--since=1.year.ago" is one way, though we may want
> to sample more randomly across time.
>
>> So far, I haven't spotted anything that would want me to disable it,
>> while I've spotted several cases where I felt that readability was
>> improved. It's somewhat difficult to spot though.
>
> I did find one case that I think is worse. Look at 857942fd1a in the
> kernel. It has a pattern like this:
>
>   ... surrounding code ...
>
>   function_one();
>   ... more surrounding code ...
>
> which becomes:
>
>   ... surrounding code ...
>
>   function_two();
>
>   ... more surrounding code
>
> Without the new heuristic, that looks like:
>
>   -function_one();
>   +function_two();
>   +
>
> but with it, it becomes:
>
>   +
>   +function_two();
>
>   -function_one();
>
> which is kind of weird. Having the two directly next to each other reads
> better to me. This is a pretty unusual diff, though, in that it did
> change the surrounding whitespace (and if you look further in the diff,
> the identical change is made elsewhere _without_ touching the
> whitespace). So this is kind of an anomaly. And IMHO the weirdness here
> is outweighed by the vast number of improvements elsewhere.

The new implementation supports the flags for ignoring white space, too.

>
> -Peff
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html



[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]