diff heuristics dramatically improved by considering line indentation and blank lines

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 06/30/2016 03:54 PM, Michael Haggerty wrote:
> On 06/23/2016 07:10 PM, Michael Haggerty wrote:
>> On 06/17/2016 06:09 PM, Stefan Beller wrote:
>>> I think before spending more time on discussing and implementing new
>>> (hopefully better) heuristics, I'd want to step back and try to be a bit more
>>> systematic, i.e. I'll want to collect lots of test cases in different languages
>>> and use cases. A mini test suite, maybe?
>>
>> Stefan,
>>
>> I've also been playing around with diff heuristics and also trying to
>> find some good test data. Have you put together some test cases yet?
> 
> I put quite of work into tooling to evaluate diff heuristics, and just
> published it on GitHub:
> 
>     https://github.com/mhagger/diff-slider-tools

Today I hand-optimized about 2700 diff sliders in 12 different
repositories and used that as a reference corpus against which I tested
the accuracy of four different algorithms. The tools and the
hand-optimized values are now in the GitHub repository mentioned above.
See also this pull request [1].

If I didn't make any mistakes, the number of errors (i.e., cases where
the algorithm choose a different slider shift than the one I picked by
hand) look like this:

> | repository           | default | compaction | indent | indent-favor-edges |
> | -------------------- | ------- | ---------- | ------ | ------------------ |
> | ant                  |     225 |        102 |      7 |                  7 |
> | bugzilla             |     208 |         81 |     14 |                 14 |
> | couchdb              |      44 |         24 |     13 |                 10 |
> | docker               |     180 |        160 |     29 |                 18 |
> | git                  |     446 |         59 |     27 |                 19 |
> | ipython              |     358 |        163 |     61 |                 11 |
> | junit                |     146 |         67 |      5 |                  5 |
> | nodejs               |     489 |         78 |     12 |                 12 |
> | phpmyadmin           |     330 |         49 |      1 |                  0 |
> | test-more            |      15 |          2 |      2 |                  0 |
> | test-unit            |      33 |         13 |      4 |                  4 |
> | -------------------- | ------- | ---------- | ------ | ------------------ |
> | totals               |    2474 |        798 |    175 |                100 |

The algorithms are:

* default -- the default behavior `git diff` on the current master
* compaction -- `git diff --compaction-heuristic`
* indent -- the indent-based algorithm as reported earlier in
  this thread
* indent-favor-edges -- the indent-based algorithm, with some
  improved heuristics regarding sliders near the begin/end of file
  and probably one or two fixes

I encourage other people to run the tests against some code samples in
their favorite programming languages so that the testing can cover a
wider range of inputs, and submit your data to the GitHub project. Also
feel free to tweak the heuristic (there's probably still room for
improvement). All the tools and raw data for my experiments are already
there.

I'll be going on vacation for the next two weeks so I probably can't
follow up on this for a while. But I think we should build an algorithm
like this into Git!

Michael

[1] https://github.com/mhagger/diff-slider-tools/pull/1

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html



[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]