Re: [PATCH] blame.c: don't drop origin blobs as eagerly

Johannes Schindelin <Johannes.Schindelin@xxxxxx> · Fri, 27 May 2016 17:00:25 +0200 (CEST)

Hi David,

it is good practice to Cc: the original author of the code in question, in
this case Junio. I guess he sees this anyway, but that is really just an
assumption.

On Fri, 27 May 2016, David Kastrup wrote:

> When a parent blob already has chunks queued up for blaming, dropping
> the blob at the end of one blame step will cause it to get reloaded
> right away, doubling the amount of I/O and unpacking when processing a
> linear history.

It is obvious from your commit message that you have studied the code
quite deeply. To make it easier for the reader (which might be your future
self), it is advisable to give at least a little bit of introduction, e.g.
what the "parent blob" is.

I would *guess* that it is the blob corresponding to the same path in the
parent of the current revision, but that should be spelled out explicitly.

> Keeping such parent blobs in memory seems like a reasonable
> optimization.  It's conceivable that this may incur additional memory

This sentence would be easier to read if "It's conceivable that" was
simply deleted.

> pressure particularly when the history contains lots of merges from
> long-diverged branches.  In practice, this optimization appears to
> behave quite benignly,

Why not just stop here? I say that because...

> and a viable strategy for limiting the total amount of cached blobs in a
> useful manner seems rather hard to implement.

... this sounds awfully handwaving. Since we already have reference
counting, it sounds fishy to claim that simply piggybacking a global
counter on top of it would be "rather hard".

> In addition, calling git-blame with -C leads to similar memory retention
> patterns.

This is a red herring. Just delete it. I, for one, being a heavy user of
`git blame`, could count the number of times I used blame's -C option
without any remaining hands. Zero times.

Besides, -C is *supposed* to look harder. By that argument, you could read
all blobs in rev-list even when the user did not specify --objects
"because --objects leads to similar memory retention patterns". So: let's
just forget about that statement.

The commit message is missing your sign-off.

Also: is there an easy way to reproduce your claims of better I/O
characteristics? Something like a command-line, ideally with a file in
git.git's own history, that demonstrates the I/O before and after the
patch, would be an excellent addition to the commit message.

Further: I would have at least expected some rudimentary discussion why
this patch -- which seems to at least partially contradict 7c3c796 (blame:
drop blob data after passing blame to the parent, 2007-12-11) -- is not
regressing on the intent of said commit.

> diff --git a/builtin/blame.c b/builtin/blame.c
> index 21f42b0..2596fbc 100644
> --- a/builtin/blame.c
> +++ b/builtin/blame.c
> @@ -1556,7 +1556,8 @@ finish:
>  	}
>  	for (i = 0; i < num_sg; i++) {
>  		if (sg_origin[i]) {
> -			drop_origin_blob(sg_origin[i]);
> +			if (!sg_origin[i]->suspects)
> +				drop_origin_blob(sg_origin[i]);
>  			origin_decref(sg_origin[i]);
>  		}

It would be good to mention in the commit message that this patch does not
change anything for blobs with only one remaining reference (the current
one) because origin_decref() would do the same job as drop_origin_blob
when decrementing the reference counter to 0.

In fact, I suspect that simply removing the drop_origin_blob() call might
result in the exact same I/O pattern.

Ciao,
Johannes
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html