Hi John, On Mon, Feb 6, 2023 at 12:47 PM John Cai <johncai86@xxxxxxxxx> wrote: > [...] > That being said, here's a separate issue. I benchmarked the usage of > .gitattributes as introduced in this patch series, and indeed it does look like > there is additional latency: > > $ echo "* diff-algorithm=patience >> .gitattributes > $ hyperfine -r 5 'git-bin-wrapper diff --diff-algorithm=patience v2.0.0 v2.28.0' ✭ > Benchmark 1: git-bin-wrapper diff --diff-algorithm=patience v2.0.0 v2.28.0 > Time (mean ± σ): 889.4 ms ± 113.8 ms [User: 715.7 ms, System: 65.3 ms] > Range (min … max): 764.1 ms … 1029.3 ms 5 runs > > $ hyperfine -r 5 'git-bin-wrapper diff v2.0.0 v2.28.0' ✭ > Benchmark 1: git-bin-wrapper diff v2.0.0 v2.28.0 > Time (mean ± σ): 2.146 s ± 0.368 s [User: 0.827 s, System: 0.243 s] > Range (min … max): 1.883 s … 2.795 s 5 runs > > and I imagine the latency scales with the size of .gitattributes. Although I'm > not familiar with other parts of the codebase and how it deals with the latency > introduced by reading attributes files. Yeah, that seems like a large relative performance penalty. I had the feeling that histogram wasn't made the default over myers mostly due to inertia and due to a potential 2% loss in performance (since potentially corrected by Phillip's 663c5ad035 ("diff histogram: intern strings", 2021-11-17)). If we had changed the default diff algorithm to histogram, I suspect folks wouldn't have been asking for per-file knobs to use a better diff algorithm. And the performance penalty for this alternative is clearly much larger than 2%, which makes me think we might want to just revisit the default instead of allowing per-file tweaks. And on a separate note... There's another set of considerations we might need to include here as well that I haven't seen anyone else in this thread talk about: * When trying to diff files, do we read the .gitattributes file from the current checkout to determine the diff algorithm(s)? Or the index? Or the commit we are diffing against? * If we use the current checkout or index, what about bare clones or diffing between two different commits? * If diffing between two different commits, and the .gitattributes has changed between those commits, which .gitattributes file wins? * If diffing between two different commits, and the .gitattributes has NOT changed, BUT a file has been renamed and the old and new names have different rules, which rule wins? * If per-file diff algorithms are adopted widely enough, will we be forced to change the merge algorithm to also pay attention to them? If it does, more complicated rename cases occur and we need rules for how to handle those. * If the merge algorithm has to pay attention to .gitattributes for this too, we'll have even more corner cases around what happens if there are merge conflicts in .gitattributes itself (which is already kind of ugly and kludged) Anyway, I know I'm a bit animated and biased in this area, and I apologize if I'm a bit too much so. Even if I am, hopefully my comments at least provide some useful context.