Re: [PATCH 2/2] fill_textconv(): Don't get/put cache if sha1 is not valid

Jeff King <peff@xxxxxxxx> · Sat, 18 Dec 2010 11:13:37 -0500

On Sat, Dec 18, 2010 at 05:54:12PM +0300, Kirill Smelkov wrote:

> It turned out, under blame there are requests to fill_textconv() with
> sha1=0000000000000000000000000000000000000000 and sha1_valid=0.
> 
> As the code did not analyzed sha1 validity, we ended up putting 000000
> into textconv cache which was fooling later blames to discover lots of
> lines in 'Not Yet Committed' state.
> [...]
> -	if (driver->textconv_cache) {
> +	if (driver->textconv_cache && df->sha1_valid) {
>  		*outbuf = notes_cache_get(driver->textconv_cache, df->sha1,
>  					  &size);

In short:

  Acked-by: Jeff King <peff@xxxxxxxx>

But it took some thinking to convince myself, so the long answer is
below if anyone cares.

I was dubious at first that this could be the right solution. We still
end up putting the filespec through run_textconv, which didn't seem
right if it is not valid.

But reading into it more, there are two levels of invalidity:

  1. !DIFF_FILE_VALID(df) - we are not a valid file at all. I.e., we are
     /dev/null.

  2. !df->sha1_valid - we are pointing to a working tree file whose sha1
     we don't know

I think level (2) never happens at all in the regular diff code, which
is why this case was completely unhandled. But it is OK in that case
(required, even) to put the contents through run_textconv.

In theory we could actually calculate the sha1 in case (2) and cache
under that, but I don't know how much it would buy us in practice. It
saves us running the textconv filter at the expense of computing the
sha1. Which is probably a win for most filters, but on the other hand,
it is the wrong place to compute such a sha1. If it is a working tree
file, we should ideally update our stat info in the index so that the
info can be reused.

-Peff

PS It is a little disturbing that in fill_textconv, we handle
case (1), !DIFF_FILE_VALID for the non-textconv case, but not so for the
textconv case. I think we are OK, as get_textconv will never load a
textconv driver for a !DIFF_FILE_VALID filespec, so we always follow the
non-textconv codepath in that case. But I am tempted to do this just to
be more defensive:

diff --git a/diff.c b/diff.c
index b0ee213..5320849 100644
--- a/diff.c
+++ b/diff.c
@@ -4404,22 +4404,25 @@ size_t fill_textconv(struct userdiff_driver *driver,
 	if (!driver || !driver->textconv) {
 		if (!DIFF_FILE_VALID(df)) {
 			*outbuf = "";
 			return 0;
 		}
 		if (diff_populate_filespec(df, 0))
 			die("unable to read files to diff");
 		*outbuf = df->data;
 		return df->size;
 	}
 
+	if (!DIFF_FILE_VALID(df))
+		die("BUG: attempt to textconv an invalid filespec");
+
 	if (driver->textconv_cache) {
 		*outbuf = notes_cache_get(driver->textconv_cache, df->sha1,
 					  &size);
 		if (*outbuf)
 			return size;
 	}
 
 	*outbuf = run_textconv(driver->textconv, df, &size);
 	if (!*outbuf)
 		die("unable to read files to diff");
 
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html