Am 26.06.2014 19:55, schrieb Junio C Hamano: > Nguyễn Thái Ngọc Duy <pclouds@xxxxxxxxx> writes: > >> Too large files may lead to failure to allocate memory. If it happens >> here, it could impact quite a few commands that involve >> diff. Moreover, too large files are inefficient to compare anyway (and >> most likely non-text), so mark them binary and skip looking at their >> content. >> ... >> diff --git a/diff.c b/diff.c >> index a489540..7a977aa 100644 >> --- a/diff.c >> +++ b/diff.c >> @@ -2185,8 +2185,8 @@ int diff_filespec_is_binary(struct diff_filespec *one) >> one->is_binary = one->driver->binary; >> else { >> if (!one->data && DIFF_FILE_VALID(one)) >> - diff_populate_filespec(one, 0); >> - if (one->data) >> + diff_populate_filespec(one, DIFF_POPULATE_IS_BINARY); >> + if (one->is_binary == -1 && one->data) >> one->is_binary = buffer_is_binary(one->data, >> one->size); >> if (one->is_binary == -1) > > The name is misleading and forced me to read it twice before I > realized that this is "populating the is-binary bit". It might make > it a bit better if you renamed it to DIFF_POPULATE_IS_BINARY_BIT or > perhaps DIFF_POPULATE_CHECK_BINARY or something. For consistency, > the other bit may want to be also renamed from SIZE_ONLY to either > > (1) CHECK_SIZE_ONLY > > (2) One bit for CHECK_SIZE, another for NO_CONTENTS, and optionally > make SIZE_ONLY the union of two > > I do not have strong preference either way; the latter may be more > logical in that "not loading contents" and "check size" are sort of > orthogonal in that you can later choose to check, without loading > contents, only the binary-ness without checking size, but no calles > that passes a non-zero flag to the populate-filespec function will > want to slurp in the contents in practice, so in that sense we could > declare that the NO_CONENTS bit is implied. > > But more importantly, would this patch actually help? For one > thing, this wouldn't (and shouldn't) help if the user wants --binary > diff out of us anyway, I suspect, but I wonder what the following > codepath in the builtin_diff() function would do: > > ... > } else if (!DIFF_OPT_TST(o, TEXT) && > ( (!textconv_one && diff_filespec_is_binary(one)) || > (!textconv_two && diff_filespec_is_binary(two)) )) { > if (fill_mmfile(&mf1, one) < 0 || fill_mmfile(&mf2, two) < 0) > die("unable to read files to diff"); > /* Quite common confusing case */ > if (mf1.size == mf2.size && > !memcmp(mf1.ptr, mf2.ptr, mf1.size)) { > if (must_show_header) > fprintf(o->file, "%s", header.buf); > goto free_ab_and_return; > } > fprintf(o->file, "%s", header.buf); > strbuf_reset(&header); > if (DIFF_OPT_TST(o, BINARY)) > emit_binary_diff(o->file, &mf1, &mf2, line_prefix); > else > fprintf(o->file, "%sBinary files %s and %s differ\n", > line_prefix, lbl[0], lbl[1]); > o->found_changes = 1; > } else { > ... > > If we weren't told with --text/-a to force textual output, and > at least one of the sides is marked as binary (and this patch marks > a large blob as binary unless attributes says otherwise), we still > call fill_mmfile() on them to slurp the contents to be compared, no? > > And before you get to the DIFF_OPT_TST(o, BINARY), you memcmp(3) to > check if the sides are identical, so... Good point. So how about an additional change roughly sketched as @@ -2223,6 +2223,14 @@ struct userdiff_driver *get_textconv(struct diff_filespec *one) return userdiff_get_textconv(one->driver); } +/* read the files in small chunks into memory and compare them */ +static int filecmp_chunked(struct diff_filespec *one, + struct diff_filespec *two) +{ + // TODO add implementation + return 0; +} + static void builtin_diff(const char *name_a, const char *name_b, struct diff_filespec *one, @@ -2325,19 +2333,26 @@ static void builtin_diff(const char *name_a, } else if (!DIFF_OPT_TST(o, TEXT) && ( (!textconv_one && diff_filespec_is_binary(one)) || (!textconv_two && diff_filespec_is_binary(two)) )) { - if (fill_mmfile(&mf1, one) < 0 || fill_mmfile(&mf2,two)< 0) - die("unable to read files to diff"); + + unsigned long size1 = diff_filespec_size(one); + unsigned long size2 = diff_filespec_size(two); + + if (size1 == 0 || size2 == 0) + die("unable to retrieve file sizes for diff"); /* Quite common confusing case */ - if (mf1.size == mf2.size && - !memcmp(mf1.ptr, mf2.ptr, mf1.size)) { + if (size1 == size2 && !filecmp_chunked(one,two)) { if (must_show_header) fprintf(o->file, "%s", header.buf); goto free_ab_and_return; } fprintf(o->file, "%s", header.buf); strbuf_reset(&header); - if (DIFF_OPT_TST(o, BINARY)) + if (DIFF_OPT_TST(o, BINARY)) { + if (fill_mmfile(&mf1, one) < 0 || fill_mmfile(&mf2, two) < 0) + die("unable to read files to diff"); emit_binary_diff(o->file, &mf1, &mf2, line_prefix); + } else fprintf(o->file, "%sBinary files %s and %s differ\n", line_prefix, lbl[0], lbl[1]); Then the default diff case, no BINARY flag, would not read both files into memory. filecmp_chunked will be slower than file_mmfile and memcmp, but its whole purpose is to read and compare the files in chunks. The chunk size can be something like 64MiB. -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html