On Wed, Nov 11, 2020 at 5:54 AM Derrick Stolee <stolee@xxxxxxxxx> wrote: > > On 11/2/2020 3:43 PM, Elijah Newren wrote: > > I have some ideas for using a histogram diff to improve content merges, > > which fundamentally relies on the idea of a histogram. Since the diffs > > are never displayed to the user but just used internally for merging, > > the typical user preference shouldn't matter anyway, and I want to make > > sure that all my testing works with this algorithm. > > > > Granted, I don't yet know if those ideas will pan out and I haven't even > > tried any of them out yet, but it's easy to change the diff algorithm in > > the future if needed or wanted. For now, just set it to histogram. > > If you are not making use of the histogram yet, then could you set this > patch aside until you _do_ use it? Or are there performance implications > that are also a side benefit? Long story... git folks tend to value performance pretty strongly -- including sometimes valuing it OVER correctness. For example, if fast-export completely munges some merge commits and you send your first ever patch in to the list to fix it (by turning on topo_order), you might run into folks asking if it should be made an option so that we don't slow down exports except for people who happen to know they need it (and thus risk breaking exports for people who happen to be unaware that they need it...). Luckily, Peff bailed me out in that situation by doing some timings and finding that topo_order actually made fast-export _faster_, to everyone's surprise at the time. Or, to take another example, perhaps someone will introduce some commit-date cutoff on the revision walking machinery that sometimes breaks the answer but makes things faster...and then causes a bunch of headaches years down the road when someone tries to introduce commit-graphs to get us always correct _and_ fast answers. In this case, histogram diffs in my cursory investigation are about 2% slower than Myers diffs. I think others may have done even more detailed benchmarks. They've been around for years, but haven't been made the default, despite giving obviously better looking diffs to users in a number of cases where Myers diffs are unintelligible. But, far more importantly, there are real merge bugs we know about that are even affecting git.git and linux.git that I don't have a clue how to address without the additional information that I believe is provided by histogram diffs. See the following: https://lore.kernel.org/git/20190816184051.GB13894@xxxxxxxxxxxxxxxxxxxxx/ https://lore.kernel.org/git/CABPp-BHvJHpSJT7sdFwfNcPn_sOXwJi3=o14qjZS3M8Rzcxe2A@xxxxxxxxxxxxxx/ https://lore.kernel.org/git/CABPp-BGtez4qjbtFT1hQoREfcJPmk9MzjhY5eEq1QhXT23tFOw@xxxxxxxxxxxxxx/ I don't like mismerges. I really don't like silent mismerges. While I am sometimes willing to make performance and correctness tradeoff, I'm much more interested in correctness in general. I want to fix the above bugs. I have not yet started doing so, but I believe histogram diff at least gives me an angle. But I can't rely on using the information from histogram diff unless it's in use. And it hasn't been used because of a few percentage performance hit. But, since I happen to be speeding up typical non-trivial merges/rebases/cherry-picks by factors of 10 or more (at least, anywhere that read-and-update-the-index is a trivial percentage of overall time instead of 99.9% of overall time like it is for you), now is golden opportunity to switch out the underlying diff algorithm so that I can get the data I need to fix the bugs I know are there. Whether I can actually fix them is yet to be seen; I won't even start until merge-ort is complete and merged. Does that help?