Re: [PATCH v2 4/4] diff: mark any file larger than core.bigfilethreshold binary

Thomas Braun <thomas.braun@xxxxxxxxxxxxxxxxxxx> · Fri, 27 Jun 2014 20:56:17 +0200




Am 26.06.2014 19:55, schrieb Junio C Hamano:
> Nguyễn Thái Ngọc Duy  <pclouds@xxxxxxxxx> writes:
> 
>> Too large files may lead to failure to allocate memory. If it happens
>> here, it could impact quite a few commands that involve
>> diff. Moreover, too large files are inefficient to compare anyway (and
>> most likely non-text), so mark them binary and skip looking at their
>> content.
>> ...
>> diff --git a/diff.c b/diff.c
>> index a489540..7a977aa 100644
>> --- a/diff.c
>> +++ b/diff.c
>> @@ -2185,8 +2185,8 @@ int diff_filespec_is_binary(struct diff_filespec *one)
>>  			one->is_binary = one->driver->binary;
>>  		else {
>>  			if (!one->data && DIFF_FILE_VALID(one))
>> -				diff_populate_filespec(one, 0);
>> -			if (one->data)
>> +				diff_populate_filespec(one, DIFF_POPULATE_IS_BINARY);
>> +			if (one->is_binary == -1 && one->data)
>>  				one->is_binary = buffer_is_binary(one->data,
>>  						one->size);
>>  			if (one->is_binary == -1)
> 
> The name is misleading and forced me to read it twice before I
> realized that this is "populating the is-binary bit".  It might make
> it a bit better if you renamed it to DIFF_POPULATE_IS_BINARY_BIT or
> perhaps DIFF_POPULATE_CHECK_BINARY or something.  For consistency,
> the other bit may want to be also renamed from SIZE_ONLY to either
> 
>  (1) CHECK_SIZE_ONLY
> 
>  (2) One bit for CHECK_SIZE, another for NO_CONTENTS, and optionally
>      make SIZE_ONLY the union of two
> 
> I do not have strong preference either way; the latter may be more
> logical in that "not loading contents" and "check size" are sort of
> orthogonal in that you can later choose to check, without loading
> contents, only the binary-ness without checking size, but no calles
> that passes a non-zero flag to the populate-filespec function will
> want to slurp in the contents in practice, so in that sense we could
> declare that the NO_CONENTS bit is implied.
> 
> But more importantly, would this patch actually help?  For one
> thing, this wouldn't (and shouldn't) help if the user wants --binary
> diff out of us anyway, I suspect, but I wonder what the following
> codepath in the builtin_diff() function would do:
> 
> 		...
> 	} else if (!DIFF_OPT_TST(o, TEXT) &&
> 	    ( (!textconv_one && diff_filespec_is_binary(one)) ||
> 	      (!textconv_two && diff_filespec_is_binary(two)) )) {
> 		if (fill_mmfile(&mf1, one) < 0 || fill_mmfile(&mf2, two) < 0)
> 			die("unable to read files to diff");
> 		/* Quite common confusing case */
> 		if (mf1.size == mf2.size &&
> 		    !memcmp(mf1.ptr, mf2.ptr, mf1.size)) {
> 			if (must_show_header)
> 				fprintf(o->file, "%s", header.buf);
> 			goto free_ab_and_return;
> 		}
> 		fprintf(o->file, "%s", header.buf);
> 		strbuf_reset(&header);
> 		if (DIFF_OPT_TST(o, BINARY))
> 			emit_binary_diff(o->file, &mf1, &mf2, line_prefix);
> 		else
> 			fprintf(o->file, "%sBinary files %s and %s differ\n",
> 				line_prefix, lbl[0], lbl[1]);
> 		o->found_changes = 1;
> 	} else {
> 		...
> 
> If we weren't told with --text/-a to force textual output, and
> at least one of the sides is marked as binary (and this patch marks
> a large blob as binary unless attributes says otherwise), we still
> call fill_mmfile() on them to slurp the contents to be compared, no?
> 
> And before you get to the DIFF_OPT_TST(o, BINARY), you memcmp(3) to
> check if the sides are identical, so...

Good point. So how about an additional change roughly sketched as

@@ -2223,6 +2223,14 @@ struct userdiff_driver *get_textconv(struct
diff_filespec *one)
 	return userdiff_get_textconv(one->driver);
 }

+/* read the files in small chunks into memory and compare them */
+static int filecmp_chunked(struct diff_filespec *one,
+	struct diff_filespec *two)
+{
+	// TODO add implementation
+	return 0;
+}
+
 static void builtin_diff(const char *name_a,
 			 const char *name_b,
 			 struct diff_filespec *one,
@@ -2325,19 +2333,26 @@ static void builtin_diff(const char *name_a,
 	} else if (!DIFF_OPT_TST(o, TEXT) &&
 	    ( (!textconv_one && diff_filespec_is_binary(one)) ||
 	      (!textconv_two && diff_filespec_is_binary(two)) )) {
-		if (fill_mmfile(&mf1, one) < 0 || fill_mmfile(&mf2,two)< 0)
-			die("unable to read files to diff");
+
+		unsigned long size1 = diff_filespec_size(one);
+		unsigned long size2 = diff_filespec_size(two);
+
+		if (size1 == 0 || size2 == 0)
+			die("unable to retrieve file sizes for diff");
 		/* Quite common confusing case */
-		if (mf1.size == mf2.size &&
-		    !memcmp(mf1.ptr, mf2.ptr, mf1.size)) {
+		if (size1 == size2 && !filecmp_chunked(one,two)) {
 			if (must_show_header)
 				fprintf(o->file, "%s", header.buf);
 			goto free_ab_and_return;
 		}
 		fprintf(o->file, "%s", header.buf);
 		strbuf_reset(&header);
-		if (DIFF_OPT_TST(o, BINARY))
+		if (DIFF_OPT_TST(o, BINARY)) {
+			if (fill_mmfile(&mf1, one) < 0 || fill_mmfile(&mf2, two) < 0)
+				die("unable to read files to diff");
 			emit_binary_diff(o->file, &mf1, &mf2, line_prefix);
+		}
 		else
 			fprintf(o->file, "%sBinary files %s and %s differ\n",
 				line_prefix, lbl[0], lbl[1]);

Then the default diff case, no BINARY flag, would not read both files into memory.
filecmp_chunked will be slower than file_mmfile and memcmp, but its whole purpose is to read and compare the files in chunks.
The chunk size can be something like 64MiB.


--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html