Hi John
On 06/02/2023 20:37, John Cai wrote:
On 6 Feb 2023, at 11:39, Ævar Arnfjörð Bjarmason wrote:
On Sun, Feb 05 2023, John Cai via GitGitGadget wrote:
For some non-nitpicking, I do worry about exposing this as a DoS vector,
e.g. here's a diff between two distant points in git.git with the
various algorithms:
$ hyperfine -r 1 -L a patience,minimal,histogram,myers 'git diff --diff-algorithm={a} v2.0.0 v2.28.0'
Benchmark 1: git diff --diff-algorithm=patience v2.0.0 v2.28.0
Time (abs ≡): 42.121 s [User: 41.879 s, System: 0.144 s]
Benchmark 2: git diff --diff-algorithm=minimal v2.0.0 v2.28.0
Time (abs ≡): 35.634 s [User: 35.473 s, System: 0.160 s]
Benchmark 3: git diff --diff-algorithm=histogram v2.0.0 v2.28.0
Time (abs ≡): 46.912 s [User: 46.657 s, System: 0.228 s]
Benchmark 4: git diff --diff-algorithm=myers v2.0.0 v2.28.0
Time (abs ≡): 33.233 s [User: 33.072 s, System: 0.160 s]
Summary
'git diff --diff-algorithm=myers v2.0.0 v2.28.0' ran
1.07 times faster than 'git diff --diff-algorithm=minimal v2.0.0 v2.28.0'
1.27 times faster than 'git diff --diff-algorithm=patience v2.0.0 v2.28.0'
1.41 times faster than 'git diff --diff-algorithm=histogram v2.0.0 v2.28.0'
Thanks for this analysis. To clarify, .gitconfig's diff.algorithm setting is
already an attack vector right?
.gitconfig is under the user's control though whereas .gitattributes is
attacker controlled if one clones a malicious repository. Having said
the worst results above are for the historgram algorithm that merge-ort
uses internally and no one has complained about ort's performance.
I see how this would be adding another one.
That being said, here's a separate issue. I benchmarked the usage of
.gitattributes as introduced in this patch series, and indeed it does look like
there is additional latency:
$ echo "* diff-algorithm=patience >> .gitattributes
$ hyperfine -r 5 'git-bin-wrapper diff --diff-algorithm=patience v2.0.0 v2.28.0' ✭
Benchmark 1: git-bin-wrapper diff --diff-algorithm=patience v2.0.0 v2.28.0
Time (mean ± σ): 889.4 ms ± 113.8 ms [User: 715.7 ms, System: 65.3 ms]
Range (min … max): 764.1 ms … 1029.3 ms 5 runs
$ hyperfine -r 5 'git-bin-wrapper diff v2.0.0 v2.28.0' ✭
Benchmark 1: git-bin-wrapper diff v2.0.0 v2.28.0
Time (mean ± σ): 2.146 s ± 0.368 s [User: 0.827 s, System: 0.243 s]
Range (min … max): 1.883 s … 2.795 s 5 runs
and I imagine the latency scales with the size of .gitattributes. Although I'm
not familiar with other parts of the codebase and how it deals with the latency
introduced by reading attributes files.
Ouch! Thanks for benchmarking that is a suspiciously large slow down.
I've a feeling the attributes code parses all the .gitattribute files
from scratch for each path that's queried, so there may be scope for
making it a bit smarter. I see some slow down but no where near as much
$ hyperfine -r 3 'bin-wrappers/git diff --diff-algorithm=patience
--no-color --no-color-moved v2.0.0 v2.28.0'
Benchmark 1: bin-wrappers/git diff --diff-algorithm=patience --no-color
--no-color-moved v2.0.0 v2.28.0
Time (mean ± σ): 1.996 s ± 0.008 s [User: 1.706 s, System:
0.286 s]
Range (min … max): 1.989 s … 2.004 s 3 runs
$ hyperfine -r 5 'bin-wrappers/git diff --no-color --no-color-moved
v2.0.0 v2.28.0'
Benchmark 1: bin-wrappers/git diff --no-color --no-color-moved v2.0.0
v2.28.0
Time (mean ± σ): 2.238 s ± 0.037 s [User: 1.880 s, System:
0.350 s]
Range (min … max): 2.216 s … 2.303 s 5 runs
Perhaps I'm over simplifying but having read the issue you linked to I
couldn't help feeling that the majority of users might be satisfied by
just changing gitlab to use the patience algorithm when generating
diffs. The idea to use .gitattributes seems to have come from Patrick
rather than a user request.
This is slightly off topic but one thing I'd really like is a way to
tell diff use automatically use --diff-words on some files (e.g.
Documentation/*)
Best Wishes
Phillip
Now, all of those are very slow overall, but some much more than
others. I seem to recall that the non-default ones also had some
pathological cases.
Another thing to think about is that we've so far considered the diff
algorithm to be purely about presentation, with some notable exceptions
such as "patch-id".
I've advocated for us getting to the point of having an in-repo
.gitconfig or .gitattributes before with a whitelist of settings like
diff.context for certain paths, or a diff.orderFile.
But those seem easy to promise future behavior for, v.s. an entire diff
algorithm (which we of course had before, but now we'd have it in
repository data).
Maybe that's not a distinction worth worrying about, just putting that
out there.
I think if others are concerned about the above something that would
neatly side-step those is to have it opt-in via the .git/config somehow,
similar to e.g. how you can commit *.gpg content, put this in
.gitattributes:
*.gpg diff=gpg
But not have it do anything until this is in the repo's .git/config (or
similar):
[diff "gpg"]
textconv = gpg --no-tty --decrypt
For that you could still keep the exact .gitattributes format you have
here, i.e.:
file* diff-algorithm=$STRATEGY
But we to pick it up we'd need either:
[diff-algorithm]
histogram = myers
Or:
[diff-algorithm "histogram"]
allow = true
This would help address slowness from the diff algorithm itself. I'm not opposed
to adding this config if this attack vector is concerning to people.
However, it wouldn't help address the additional latency of scanning
.gitattributes to find the diff algorithm.
Would a separate config to allow gitattributes be helpful here?
[diff-algorithm]
attributes = true
The former form being one that would allow you to map the .gitattributes
of the repo (but maybe that would be redundant to
.git/info/attributes)...