Re: [PATCH v2 4/4] diff: mark any file larger than core.bigfilethreshold binary

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Nguyễn Thái Ngọc Duy  <pclouds@xxxxxxxxx> writes:

> Too large files may lead to failure to allocate memory. If it happens
> here, it could impact quite a few commands that involve
> diff. Moreover, too large files are inefficient to compare anyway (and
> most likely non-text), so mark them binary and skip looking at their
> content.
> ...
> diff --git a/diff.c b/diff.c
> index a489540..7a977aa 100644
> --- a/diff.c
> +++ b/diff.c
> @@ -2185,8 +2185,8 @@ int diff_filespec_is_binary(struct diff_filespec *one)
>  			one->is_binary = one->driver->binary;
>  		else {
>  			if (!one->data && DIFF_FILE_VALID(one))
> -				diff_populate_filespec(one, 0);
> -			if (one->data)
> +				diff_populate_filespec(one, DIFF_POPULATE_IS_BINARY);
> +			if (one->is_binary == -1 && one->data)
>  				one->is_binary = buffer_is_binary(one->data,
>  						one->size);
>  			if (one->is_binary == -1)

The name is misleading and forced me to read it twice before I
realized that this is "populating the is-binary bit".  It might make
it a bit better if you renamed it to DIFF_POPULATE_IS_BINARY_BIT or
perhaps DIFF_POPULATE_CHECK_BINARY or something.  For consistency,
the other bit may want to be also renamed from SIZE_ONLY to either

 (1) CHECK_SIZE_ONLY

 (2) One bit for CHECK_SIZE, another for NO_CONTENTS, and optionally
     make SIZE_ONLY the union of two

I do not have strong preference either way; the latter may be more
logical in that "not loading contents" and "check size" are sort of
orthogonal in that you can later choose to check, without loading
contents, only the binary-ness without checking size, but no calles
that passes a non-zero flag to the populate-filespec function will
want to slurp in the contents in practice, so in that sense we could
declare that the NO_CONENTS bit is implied.

But more importantly, would this patch actually help?  For one
thing, this wouldn't (and shouldn't) help if the user wants --binary
diff out of us anyway, I suspect, but I wonder what the following
codepath in the builtin_diff() function would do:

		...
	} else if (!DIFF_OPT_TST(o, TEXT) &&
	    ( (!textconv_one && diff_filespec_is_binary(one)) ||
	      (!textconv_two && diff_filespec_is_binary(two)) )) {
		if (fill_mmfile(&mf1, one) < 0 || fill_mmfile(&mf2, two) < 0)
			die("unable to read files to diff");
		/* Quite common confusing case */
		if (mf1.size == mf2.size &&
		    !memcmp(mf1.ptr, mf2.ptr, mf1.size)) {
			if (must_show_header)
				fprintf(o->file, "%s", header.buf);
			goto free_ab_and_return;
		}
		fprintf(o->file, "%s", header.buf);
		strbuf_reset(&header);
		if (DIFF_OPT_TST(o, BINARY))
			emit_binary_diff(o->file, &mf1, &mf2, line_prefix);
		else
			fprintf(o->file, "%sBinary files %s and %s differ\n",
				line_prefix, lbl[0], lbl[1]);
		o->found_changes = 1;
	} else {
		...

If we weren't told with --text/-a to force textual output, and
at least one of the sides is marked as binary (and this patch marks
a large blob as binary unless attributes says otherwise), we still
call fill_mmfile() on them to slurp the contents to be compared, no?

And before you get to the DIFF_OPT_TST(o, BINARY), you memcmp(3) to
check if the sides are identical, so...

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html




[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]