On 06/30/2016 03:54 PM, Michael Haggerty wrote: > On 06/23/2016 07:10 PM, Michael Haggerty wrote: >> On 06/17/2016 06:09 PM, Stefan Beller wrote: >>> I think before spending more time on discussing and implementing new >>> (hopefully better) heuristics, I'd want to step back and try to be a bit more >>> systematic, i.e. I'll want to collect lots of test cases in different languages >>> and use cases. A mini test suite, maybe? >> >> Stefan, >> >> I've also been playing around with diff heuristics and also trying to >> find some good test data. Have you put together some test cases yet? > > I put quite of work into tooling to evaluate diff heuristics, and just > published it on GitHub: > > https://github.com/mhagger/diff-slider-tools Today I hand-optimized about 2700 diff sliders in 12 different repositories and used that as a reference corpus against which I tested the accuracy of four different algorithms. The tools and the hand-optimized values are now in the GitHub repository mentioned above. See also this pull request [1]. If I didn't make any mistakes, the number of errors (i.e., cases where the algorithm choose a different slider shift than the one I picked by hand) look like this: > | repository | default | compaction | indent | indent-favor-edges | > | -------------------- | ------- | ---------- | ------ | ------------------ | > | ant | 225 | 102 | 7 | 7 | > | bugzilla | 208 | 81 | 14 | 14 | > | couchdb | 44 | 24 | 13 | 10 | > | docker | 180 | 160 | 29 | 18 | > | git | 446 | 59 | 27 | 19 | > | ipython | 358 | 163 | 61 | 11 | > | junit | 146 | 67 | 5 | 5 | > | nodejs | 489 | 78 | 12 | 12 | > | phpmyadmin | 330 | 49 | 1 | 0 | > | test-more | 15 | 2 | 2 | 0 | > | test-unit | 33 | 13 | 4 | 4 | > | -------------------- | ------- | ---------- | ------ | ------------------ | > | totals | 2474 | 798 | 175 | 100 | The algorithms are: * default -- the default behavior `git diff` on the current master * compaction -- `git diff --compaction-heuristic` * indent -- the indent-based algorithm as reported earlier in this thread * indent-favor-edges -- the indent-based algorithm, with some improved heuristics regarding sliders near the begin/end of file and probably one or two fixes I encourage other people to run the tests against some code samples in their favorite programming languages so that the testing can cover a wider range of inputs, and submit your data to the GitHub project. Also feel free to tweak the heuristic (there's probably still room for improvement). All the tools and raw data for my experiments are already there. I'll be going on vacation for the next two weeks so I probably can't follow up on this for a while. But I think we should build an algorithm like this into Git! Michael [1] https://github.com/mhagger/diff-slider-tools/pull/1 -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html