Re: [PATCH 2/2] diff: teach diff to read gitattribute diff-algorithm

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Wed, Feb 08 2023, Elijah Newren wrote:

> On Mon, Feb 6, 2023 at 9:24 AM Ævar Arnfjörð Bjarmason <avarab@xxxxxxxxx> wrote:
>>
>> On Sun, Feb 05 2023, John Cai via GitGitGadget wrote:
>>
>> > From: John Cai <johncai86@xxxxxxxxx>
>> > [...]
>> > +
>> > +             if (!o->xdl_opts_command_line) {
>> > +                     static struct attr_check *check;
>> > +                     const char *one_diff_algo;
>> > +                     const char *two_diff_algo;
>> > +
>> > +                     check = attr_check_alloc();
>> > +                     attr_check_append(check, git_attr("diff-algorithm"));
>> > +
>> > +                     git_check_attr(the_repository->index, NULL, one->path, check);
>> > +                     one_diff_algo = check->items[0].value;
>> > +                     git_check_attr(the_repository->index, NULL, two->path, check);
>> > +                     two_diff_algo = check->items[0].value;
>> > +
>> > +                     if (!ATTR_UNSET(one_diff_algo) && !ATTR_UNSET(two_diff_algo) &&
>> > +                             !strcmp(one_diff_algo, two_diff_algo))
>> > +                             set_diff_algorithm(o, one_diff_algo);
>> > +
>> > +                     attr_check_free(check);
>>
>> This is a bit nitpicky, but I for one would find this much easier to
>> read with some shorter variables, here just with "a" rather than
>> "one_diff_algo", "b" instead of "two_diff_algo", and splitting
>> "the_repository->index" into "istate" (untested):
>>
>>         +               if (!o->xdl_opts_command_line) {
>>         +                       static struct attr_check *check;
>>         +                       const char *a;
>>         +                       const char *b;
>>         +                       struct index_state *istate = the_repository->index;
>>         +
>>         +                       check = attr_check_alloc();
>>         +                       attr_check_append(check, git_attr("diff-algorithm"));
>>         +
>>         +                       git_check_attr(istate, NULL, one->path, check);
>>         +                       a = check->items[0].value;
>>         +                       git_check_attr(istate, NULL, two->path, check);
>>         +                       b = check->items[0].value;
>>         +
>>         +                       if (!ATTR_UNSET(a) && !ATTR_UNSET(b) && !strcmp(a, b))
>>         +                               set_diff_algorithm(o, a);
>>         +
>>         +                       attr_check_free(check);
>>         +               }
>>
>> That also nicely keeps the line length shorter.
>>
>> > @@ -333,6 +333,8 @@ struct diff_options {
>> >       int prefix_length;
>> >       const char *stat_sep;
>> >       int xdl_opts;
>> > +     /* If xdl_opts has been set via the command line. */
>> > +     int xdl_opts_command_line;
>> >
>> >       /* see Documentation/diff-options.txt */
>> >       char **anchors;
>> > diff --git a/t/lib-diff-alternative.sh b/t/lib-diff-alternative.sh
>> > index 8d1e408bb58..630c98ea65a 100644
>> > --- a/t/lib-diff-alternative.sh
>> > +++ b/t/lib-diff-alternative.sh
>> > @@ -107,8 +107,27 @@ EOF
>> >
>> >       STRATEGY=$1
>> >
>> > +     test_expect_success "$STRATEGY diff from attributes" '
>> > +             echo "file* diff-algorithm=$STRATEGY" >.gitattributes &&
>> > +             test_must_fail git diff --no-index file1 file2 > output &&
>> > +             test_cmp expect output
>> > +     '
>> > +
>> >       test_expect_success "$STRATEGY diff" '
>> > -             test_must_fail git diff --no-index "--$STRATEGY" file1 file2 > output &&
>> > +             test_must_fail git diff --no-index "--diff-algorithm=$STRATEGY" file1 file2 > output &&
>>
>> Nit: The usual style is ">output", not "> output".
>>
>> > +             test_cmp expect output
>> > +     '
>> > +
>> > +     test_expect_success "$STRATEGY diff command line precedence before attributes" '
>> > +             echo "file* diff-algorithm=meyers" >.gitattributes &&
>> > +             test_must_fail git diff --no-index "--diff-algorithm=$STRATEGY" file1 file2 > output &&
>> > +             test_cmp expect output
>> > +     '
>> > +
>> > +     test_expect_success "$STRATEGY diff attributes precedence before config" '
>> > +             git config diff.algorithm default &&
>> > +             echo "file* diff-algorithm=$STRATEGY" >.gitattributes &&
>> > +             test_must_fail git diff --no-index "--diff-algorithm=$STRATEGY" file1 file2 > output &&
>> >               test_cmp expect output
>> >       '
>> >
>> > @@ -166,5 +185,11 @@ EOF
>> >               test_must_fail git diff --no-index "--$STRATEGY" uniq1 uniq2 > output &&
>> >               test_cmp expect output
>> >       '
>> > +
>> > +     test_expect_success "$STRATEGY diff from attributes" '
>> > +             echo "file* diff-algorithm=$STRATEGY" >.gitattributes &&
>> > +             test_must_fail git diff --no-index uniq1 uniq2 > output &&
>> > +             test_cmp expect output
>> > +     '
>> >  }
>>
>> For some non-nitpicking, I do worry about exposing this as a DoS vector,
>> e.g. here's a diff between two distant points in git.git with the
>> various algorithms:
>>
>>         $ hyperfine -r 1 -L a patience,minimal,histogram,myers 'git diff --diff-algorithm={a} v2.0.0 v2.28.0'
>>         Benchmark 1: git diff --diff-algorithm=patience v2.0.0 v2.28.0
>>           Time (abs ≡):        42.121 s               [User: 41.879 s, System: 0.144 s]
>>
>>         Benchmark 2: git diff --diff-algorithm=minimal v2.0.0 v2.28.0
>>           Time (abs ≡):        35.634 s               [User: 35.473 s, System: 0.160 s]
>>
>>         Benchmark 3: git diff --diff-algorithm=histogram v2.0.0 v2.28.0
>>           Time (abs ≡):        46.912 s               [User: 46.657 s, System: 0.228 s]
>>
>>         Benchmark 4: git diff --diff-algorithm=myers v2.0.0 v2.28.0
>>           Time (abs ≡):        33.233 s               [User: 33.072 s, System: 0.160 s]
>>
>>         Summary
>>           'git diff --diff-algorithm=myers v2.0.0 v2.28.0' ran
>>             1.07 times faster than 'git diff --diff-algorithm=minimal v2.0.0 v2.28.0'
>>             1.27 times faster than 'git diff --diff-algorithm=patience v2.0.0 v2.28.0'
>>             1.41 times faster than 'git diff --diff-algorithm=histogram v2.0.0 v2.28.0'

[snip around]

> If you run with more than 1 run, are your numbers even repeatable?

Yes, but tl;dr it's diff.colorMoved=true, sorry, see below.

> I'm really surprised by these numbers; they aren't remotely close to
> what I compute.  Am I correct in understanding you ran these in
> git.git?  Was your computer overloaded?  Was your git.git in some
> serious need of repacking?  Was it on a network filesystem?  

Just on the box I regularly hack on, which isn't too overloaded, but I
re-did these a bit more seriously.

This is/was on a Hetzner EX41S
(https://docs.hetzner.com/robot/dedicated-server/general-information/root-server-hardware/),
but I tried it again now in /dev/shm/git.git with a fresh repo from:

	git clone --bare git@xxxxxxxxxx:git/git.git

And (I didn't prune out the notice here where it says it's fuzzy (not
earlier either)).

There's some Java thing eating ~50% of 1/8 CPU cores, but otherwise the
box is pretty idle, and this is current "master" with "make CFLAGS=-O3":
	
	$ hyperfine -w 2 -r 10 -L a patience,minimal,histogram,myers './git -C /dev/shm/git.git diff --diff-algorithm={a} v2.0.0 v2.28.0'
	Benchmark 1: ./git -C /dev/shm/git.git diff --diff-algorithm=patience v2.0.0 v2.28.0
	  Time (mean ± σ):     41.131 s ±  0.550 s    [User: 40.990 s, System: 0.104 s]
	  Range (min … max):   40.323 s … 42.172 s    10 runs
	
	Benchmark 2: ./git -C /dev/shm/git.git diff --diff-algorithm=minimal v2.0.0 v2.28.0
	  Time (mean ± σ):     34.821 s ±  0.307 s    [User: 34.707 s, System: 0.100 s]
	  Range (min … max):   34.512 s … 35.523 s    10 runs
	
	Benchmark 3: ./git -C /dev/shm/git.git diff --diff-algorithm=histogram v2.0.0 v2.28.0
	  Time (mean ± σ):     45.443 s ±  0.274 s    [User: 45.328 s, System: 0.107 s]
	  Range (min … max):   44.932 s … 45.810 s    10 runs
	
	Benchmark 4: ./git -C /dev/shm/git.git diff --diff-algorithm=myers v2.0.0 v2.28.0
	  Time (mean ± σ):     33.016 s ±  0.505 s    [User: 32.893 s, System: 0.094 s]
	  Range (min … max):   32.376 s … 33.999 s    10 runs
	
	Summary
	  './git -C /dev/shm/git.git diff --diff-algorithm=myers v2.0.0 v2.28.0' ran
	    1.05 ± 0.02 times faster than './git -C /dev/shm/git.git diff --diff-algorithm=minimal v2.0.0 v2.28.0'
	    1.25 ± 0.03 times faster than './git -C /dev/shm/git.git diff --diff-algorithm=patience v2.0.0 v2.28.0'
	    1.38 ± 0.02 times faster than './git -C /dev/shm/git.git diff --diff-algorithm=histogram v2.0.0 v2.28.0'

So that's pretty much the same as my earlier results, but between the
fresh repo, ram disk, warmup & 10 measuremnets for each these should be
more accurate.

> Using git compiled from current main, I see:
>
> $ hyperfine -L a patience,minimal,histogram,myers './git diff
> --diff-algorithm={a} v2.0.0 v2.28.0'
> Benchmark 1: ./git diff --diff-algorithm=patience v2.0.0 v2.28.0
>   Time (mean ± σ):      1.142 s ±  0.033 s    [User: 1.022 s, System: 0.114 s]
>   Range (min … max):    1.117 s …  1.212 s    10 runs
>
> Benchmark 2: ./git diff --diff-algorithm=minimal v2.0.0 v2.28.0
>   Time (mean ± σ):      1.959 s ±  0.011 s    [User: 1.830 s, System: 0.120 s]
>   Range (min … max):    1.947 s …  1.976 s    10 runs
>
> Benchmark 3: ./git diff --diff-algorithm=histogram v2.0.0 v2.28.0
>   Time (mean ± σ):      1.187 s ±  0.007 s    [User: 1.065 s, System: 0.115 s]
>   Range (min … max):    1.175 s …  1.200 s    10 runs
>
> Benchmark 4: ./git diff --diff-algorithm=myers v2.0.0 v2.28.0
>   Time (mean ± σ):      1.194 s ±  0.007 s    [User: 1.068 s, System: 0.120 s]
>   Range (min … max):    1.184 s …  1.206 s    10 runs
>
> Summary
>   './git diff --diff-algorithm=patience v2.0.0 v2.28.0' ran
>     1.04 ± 0.03 times faster than './git diff
> --diff-algorithm=histogram v2.0.0 v2.28.0'
>     1.05 ± 0.03 times faster than './git diff --diff-algorithm=myers
> v2.0.0 v2.28.0'
>     1.71 ± 0.05 times faster than './git diff --diff-algorithm=minimal
> v2.0.0 v2.28.0'

But without diff.colorMoved=true I see basically your results:
	
	$ hyperfine -w 2 -r 10 -L a patience,minimal,histogram,myers './git -c diff.colorMoved=false -C /dev/shm/git.git diff --diff-algorithm={a} v2.0.0 v2.28.0'
	Benchmark 1: ./git -c diff.colorMoved=false -C /dev/shm/git.git diff --diff-algorithm=patience v2.0.0 v2.28.0
	  Time (mean ± σ):     760.9 ms ±  45.2 ms    [User: 698.6 ms, System: 62.1 ms]
	  Range (min … max):   719.0 ms … 862.2 ms    10 runs
	 
	Benchmark 2: ./git -c diff.colorMoved=false -C /dev/shm/git.git diff --diff-algorithm=minimal v2.0.0 v2.28.0
	  Time (mean ± σ):      1.347 s ±  0.041 s    [User: 1.281 s, System: 0.065 s]
	  Range (min … max):    1.305 s …  1.417 s    10 runs
	 
	Benchmark 3: ./git -c diff.colorMoved=false -C /dev/shm/git.git diff --diff-algorithm=histogram v2.0.0 v2.28.0
	  Time (mean ± σ):     826.3 ms ±  51.1 ms    [User: 767.5 ms, System: 58.6 ms]
	  Range (min … max):   773.7 ms … 929.8 ms    10 runs
	 
	Benchmark 4: ./git -c diff.colorMoved=false -C /dev/shm/git.git diff --diff-algorithm=myers v2.0.0 v2.28.0
	  Time (mean ± σ):     801.1 ms ±  39.4 ms    [User: 736.0 ms, System: 64.9 ms]
	  Range (min … max):   771.6 ms … 904.2 ms    10 runs
	 
	Summary
	  './git -c diff.colorMoved=false -C /dev/shm/git.git diff --diff-algorithm=patience v2.0.0 v2.28.0' ran
	    1.05 ± 0.08 times faster than './git -c diff.colorMoved=false -C /dev/shm/git.git diff --diff-algorithm=myers v2.0.0 v2.28.0'
	    1.09 ± 0.09 times faster than './git -c diff.colorMoved=false -C /dev/shm/git.git diff --diff-algorithm=histogram v2.0.0 v2.28.0'
	    1.77 ± 0.12 times faster than './git -c diff.colorMoved=false -C /dev/shm/git.git diff --diff-algorithm=minimal v2.0.0 v2.28.0'

So they're all within the fuzz-factor, except "minimal".

> And this is on a kind-of low-end refurbished laptop from a few years
> ago (although the repo was recently gc'ed).
>
> I'm biased towards histogram (and making it the default rather than
> making it configurable per file), but that's probably obvious given
> that I made ort use it unconditionally.  And when I made ort use it,
> it was actually a minor performance penalty (~1% IIRC)[*], but I
> thought it was worth it since (a) histogram diffs are more
> understandable to users in general, (b) the histogram diff data
> structures provide an idea for possibly solving some ugly corner cases
> that I don't see a way to solve with the other diffs.

To bring this all home the thing I was going for upthread was to raise
"is this a concern?" I was inclined to think "no", but didn't
know. Since someone who knows way more about the diffs than I probably
ever will (i.e. you :) isn't waiving their hands in panic here I think
we can just consider this checkbox ticked.

I.e. I've seen cases (and this is from vague recollection, I've got no
examples in front of me) where one diff algorithm is worse than others
on performance.

But if your intent is to DoS a service provider you can also just commit
some very long files or whatever, so whether the user can change the
diff algorithm is probably not that interesting, as long as the
performance is within some reasonable bound.




[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux