Re: [PATCH 2/2] diff: teach diff to read gitattribute diff-algorithm

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Ævar,

On 6 Feb 2023, at 11:39, Ævar Arnfjörð Bjarmason wrote:

> On Sun, Feb 05 2023, John Cai via GitGitGadget wrote:
>
>> From: John Cai <johncai86@xxxxxxxxx>
>> [...]
>> +
>> +		if (!o->xdl_opts_command_line) {
>> +			static struct attr_check *check;
>> +			const char *one_diff_algo;
>> +			const char *two_diff_algo;
>> +
>> +			check = attr_check_alloc();
>> +			attr_check_append(check, git_attr("diff-algorithm"));
>> +
>> +			git_check_attr(the_repository->index, NULL, one->path, check);
>> +			one_diff_algo = check->items[0].value;
>> +			git_check_attr(the_repository->index, NULL, two->path, check);
>> +			two_diff_algo = check->items[0].value;
>> +
>> +			if (!ATTR_UNSET(one_diff_algo) && !ATTR_UNSET(two_diff_algo) &&
>> +				!strcmp(one_diff_algo, two_diff_algo))
>> +				set_diff_algorithm(o, one_diff_algo);
>> +
>> +			attr_check_free(check);
>
> This is a bit nitpicky, but I for one would find this much easier to
> read with some shorter variables, here just with "a" rather than
> "one_diff_algo", "b" instead of "two_diff_algo", and splitting
> "the_repository->index" into "istate" (untested):
> 	
> 	+		if (!o->xdl_opts_command_line) {
> 	+			static struct attr_check *check;
> 	+			const char *a;
> 	+			const char *b;
> 	+			struct index_state *istate = the_repository->index;
> 	+
> 	+			check = attr_check_alloc();
> 	+			attr_check_append(check, git_attr("diff-algorithm"));
> 	+
> 	+			git_check_attr(istate, NULL, one->path, check);
> 	+			a = check->items[0].value;
> 	+			git_check_attr(istate, NULL, two->path, check);
> 	+			b = check->items[0].value;
> 	+
> 	+			if (!ATTR_UNSET(a) && !ATTR_UNSET(b) && !strcmp(a, b))
> 	+				set_diff_algorithm(o, a);
> 	+
> 	+			attr_check_free(check);
> 	+		}
>
> That also nicely keeps the line length shorter.

Thanks, I think this does look better.

>
>> @@ -333,6 +333,8 @@ struct diff_options {
>>  	int prefix_length;
>>  	const char *stat_sep;
>>  	int xdl_opts;
>> +	/* If xdl_opts has been set via the command line. */
>> +	int xdl_opts_command_line;
>>
>>  	/* see Documentation/diff-options.txt */
>>  	char **anchors;
>> diff --git a/t/lib-diff-alternative.sh b/t/lib-diff-alternative.sh
>> index 8d1e408bb58..630c98ea65a 100644
>> --- a/t/lib-diff-alternative.sh
>> +++ b/t/lib-diff-alternative.sh
>> @@ -107,8 +107,27 @@ EOF
>>
>>  	STRATEGY=$1
>>
>> +	test_expect_success "$STRATEGY diff from attributes" '
>> +		echo "file* diff-algorithm=$STRATEGY" >.gitattributes &&
>> +		test_must_fail git diff --no-index file1 file2 > output &&
>> +		test_cmp expect output
>> +	'
>> +
>>  	test_expect_success "$STRATEGY diff" '
>> -		test_must_fail git diff --no-index "--$STRATEGY" file1 file2 > output &&
>> +		test_must_fail git diff --no-index "--diff-algorithm=$STRATEGY" file1 file2 > output &&
>
> Nit: The usual style is ">output", not "> output".

noted!

>
>> +		test_cmp expect output
>> +	'
>> +
>> +	test_expect_success "$STRATEGY diff command line precedence before attributes" '
>> +		echo "file* diff-algorithm=meyers" >.gitattributes &&
>> +		test_must_fail git diff --no-index "--diff-algorithm=$STRATEGY" file1 file2 > output &&
>> +		test_cmp expect output
>> +	'
>> +
>> +	test_expect_success "$STRATEGY diff attributes precedence before config" '
>> +		git config diff.algorithm default &&
>> +		echo "file* diff-algorithm=$STRATEGY" >.gitattributes &&
>> +		test_must_fail git diff --no-index "--diff-algorithm=$STRATEGY" file1 file2 > output &&
>>  		test_cmp expect output
>>  	'
>>
>> @@ -166,5 +185,11 @@ EOF
>>  		test_must_fail git diff --no-index "--$STRATEGY" uniq1 uniq2 > output &&
>>  		test_cmp expect output
>>  	'
>> +
>> +	test_expect_success "$STRATEGY diff from attributes" '
>> +		echo "file* diff-algorithm=$STRATEGY" >.gitattributes &&
>> +		test_must_fail git diff --no-index uniq1 uniq2 > output &&
>> +		test_cmp expect output
>> +	'
>>  }
>
> For some non-nitpicking, I do worry about exposing this as a DoS vector,
> e.g. here's a diff between two distant points in git.git with the
> various algorithms:
>
> 	$ hyperfine -r 1 -L a patience,minimal,histogram,myers 'git diff --diff-algorithm={a} v2.0.0 v2.28.0'
> 	Benchmark 1: git diff --diff-algorithm=patience v2.0.0 v2.28.0
> 	  Time (abs ≡):        42.121 s               [User: 41.879 s, System: 0.144 s]
> 	
> 	Benchmark 2: git diff --diff-algorithm=minimal v2.0.0 v2.28.0
> 	  Time (abs ≡):        35.634 s               [User: 35.473 s, System: 0.160 s]
> 	
> 	Benchmark 3: git diff --diff-algorithm=histogram v2.0.0 v2.28.0
> 	  Time (abs ≡):        46.912 s               [User: 46.657 s, System: 0.228 s]
> 	
> 	Benchmark 4: git diff --diff-algorithm=myers v2.0.0 v2.28.0
> 	  Time (abs ≡):        33.233 s               [User: 33.072 s, System: 0.160 s]
> 	
> 	Summary
> 	  'git diff --diff-algorithm=myers v2.0.0 v2.28.0' ran
> 	    1.07 times faster than 'git diff --diff-algorithm=minimal v2.0.0 v2.28.0'
> 	    1.27 times faster than 'git diff --diff-algorithm=patience v2.0.0 v2.28.0'
> 	    1.41 times faster than 'git diff --diff-algorithm=histogram v2.0.0 v2.28.0'

Thanks for this analysis. To clarify, .gitconfig's diff.algorithm setting is
already an attack vector right? I see how this would be adding another one.

That being said, here's a separate issue. I benchmarked the usage of
.gitattributes as introduced in this patch series, and indeed it does look like
there is additional latency:

$ echo "* diff-algorithm=patience >> .gitattributes
$ hyperfine -r 5 'git-bin-wrapper diff --diff-algorithm=patience v2.0.0 v2.28.0'                      ✭
Benchmark 1: git-bin-wrapper diff --diff-algorithm=patience v2.0.0 v2.28.0
  Time (mean ± σ):     889.4 ms ± 113.8 ms    [User: 715.7 ms, System: 65.3 ms]
  Range (min … max):   764.1 ms … 1029.3 ms    5 runs

$ hyperfine -r 5 'git-bin-wrapper diff v2.0.0 v2.28.0'                                                ✭
Benchmark 1: git-bin-wrapper diff v2.0.0 v2.28.0
  Time (mean ± σ):      2.146 s ±  0.368 s    [User: 0.827 s, System: 0.243 s]
  Range (min … max):    1.883 s …  2.795 s    5 runs

and I imagine the latency scales with the size of .gitattributes. Although I'm
not familiar with other parts of the codebase and how it deals with the latency
introduced by reading attributes files.

>
> Now, all of those are very slow overall, but some much more than
> others. I seem to recall that the non-default ones also had some
> pathological cases.
>
> Another thing to think about is that we've so far considered the diff
> algorithm to be purely about presentation, with some notable exceptions
> such as "patch-id".
>
> I've advocated for us getting to the point of having an in-repo
> .gitconfig or .gitattributes before with a whitelist of settings like
> diff.context for certain paths, or a diff.orderFile.
>
> But those seem easy to promise future behavior for, v.s. an entire diff
> algorithm (which we of course had before, but now we'd have it in
> repository data).
>
> Maybe that's not a distinction worth worrying about, just putting that
> out there.
>
> I think if others are concerned about the above something that would
> neatly side-step those is to have it opt-in via the .git/config somehow,
> similar to e.g. how you can commit *.gpg content, put this in
> .gitattributes:
>
> 	*.gpg diff=gpg
>
> But not have it do anything until this is in the repo's .git/config (or
> similar):
>
> 	[diff "gpg"]
>         	textconv = gpg --no-tty --decrypt
>
> For that you could still keep the exact .gitattributes format you have
> here, i.e.:
>
> 	file* diff-algorithm=$STRATEGY
>
> But we to pick it up we'd need either:
>
> 	[diff-algorithm]
>         	histogram = myers
>
> Or:
>
> 	[diff-algorithm "histogram"]
>         	allow = true

This would help address slowness from the diff algorithm itself. I'm not opposed
to adding this config if this attack vector is concerning to people.

However, it wouldn't help address the additional latency of scanning
.gitattributes to find the diff algorithm.

Would a separate config to allow gitattributes be helpful here?

[diff-algorithm]
	attributes = true

>
> The former form being one that would allow you to map the .gitattributes
> of the repo (but maybe that would be redundant to
> .git/info/attributes)...




[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux