Re: [PATCH v2] diff: Fix modified lines stats with --stat and --numstat

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Taylor Blau <me@xxxxxxxxxxxx> writes:

> On Sun, Sep 20, 2020 at 09:09:46AM -0400, Thomas Guyot-Sionnest wrote:
>> In builtin_diffstat(), when both files are coming from "stdin" (which
>> could be better described as the file's content being written directly
>> into the file object), oideq() compares two null hashes and ignores the
>> actual differences for the statistics.
>>
>> This patch checks if is_stdin flag is set on both sides and compare
>> contents directly.
>>
>> Signed-off-by: Thomas Guyot-Sionnest <tguyot@xxxxxxxxx>
>> ---
>> Range-diff:
>> 1:  479c2835fc ! 1:  1f25713d44 diff: Fix modified lines stats with --stat and --numstat
>>     @@ -20,8 +20,12 @@
>>       	}
>>
>>      -	same_contents = oideq(&one->oid, &two->oid);
>>     ++	/* What is_stdin really means is that the file's content is only
>>     ++	 * in the filespec's buffer and its oid is zero. We can't compare
>>     ++	 * oid's if both are null and we can just diff the buffers */
>>      +	if (one->is_stdin && two->is_stdin)
>>     -+		same_contents = !strcmp(one->data, two->data);
>>     ++		same_contents = (one->size == two->size ?
>>     ++			!memcmp(one->data, two->data, one->size) : 0);
>>      +	else
>>      +		same_contents = oideq(&one->oid, &two->oid);
>
> After reading your explanation in [1], this version makes more sense to
> me.

These oid fields are prepared by calling diff_fill_oid_info(), and
even for paths that are dirty (hence no "free" oid available from
index or tree entry), an appropriate oid is computed.

But there are paths for which oid cannot be computed without
destroying their contents.  Such paths are marked by the function
with null_oid.

It happens to be that stdin is the only class of paths that are
treated that way _right_ _now_, but future code may support
different kind of paths that share the same trait.

When we want to know "is comparing the oid sufficient?", we
shouldn't inspect the is_stdin flag ourselves in a caller of
diff_fill_oid_info(), because the helper _is_ responsible for
knowing what kind of paths are special, and signals that "assume
this would not be equal to anything else" by giving null_oid back.

The caller should use the info left by diff_fill_oid_info(), namely,
"even if the oid on both sides are the same, if it is null_oid, then
we know diff_fill_oid_info() didn't actually compute the oid, and we
need to compare the blob ourselves".

And there is no point in doing memcmp() here, I think.  

The same_contents() check is done as an optimization to avoid xdl.
Even if the two sides were thought to be different at the oid level,
xdl comparison may find that there is no difference after all
(e.g. think of whitespace ignoring comparison), so we should assume
and rely on that the downstream code MUST BE prepared to handle
false negatives (i.e. same_contents says they are different, but
they actually produce no diffstat).  Running memcmp() over contents
in potentially a large buffer to find that they are different, and
then have xdl process that large buffer again, would be a waste.

Summarizing the above, I think the second best fix is this (which
means that the posted patch is the third best):

	/*
	 * diff_fill_oid_info() marked one/two->oid with null_oid
	 * for a path whose oid is not available.  Disable early
	 * return optimization for them.
	 */
	if (oideq(&one->oid, &null_oid) || oideq(&two->oid, &null_oid))
		same_contents = 0; /* could be different */
	else if (oideq(&one->oid, &two->oid))
		same_contents = 1; /* definitely the same */
	else
		same_contents = 0; /* definitely different */

But I suspect that the best fix is to teach diff_fill_oid_info() to
hash the in-memory data to compute the oid, instead of punting and
filling the oid field with null_oid.  If function builtin_diffstat()
is allowed to look at the contents and run memcmp() here, the 'data'
field should have been filled and valid when diff_fill_oid_info()
looked at it already.

The "best" fix will have wider consequences, so we may not want to
jump to it right away without careful consideration.

For example, the "best" fix will fix another bug.  The 'index'
header shows a NULL object name in normal "diff --patch" output for
these paths in the current code, which means they cannot be used
with "apply --3way".  That way, this codepath does not have to know
anything about the "null means we don't know" convention.  

Try:

    $ (cat COPYING; echo) >RENAMING
    $ git diff --no-index COPYING - <RENAMING | grep '^index '
    index 536e55524d..0000000000 100644

and notice that the stdin side has a null object name in the current
code.  I think we will show the right object name if we fix the
diff_fill_oid_info().

Thanks.





[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux