Re: [PATCH v3 3/6] log: fix a memory leak in "git show <revision>..."

Junio C Hamano <gitster@xxxxxxxxx> · Wed, 03 Aug 2022 10:54:09 -0700

Ævar Arnfjörð Bjarmason  <avarab@xxxxxxxxx> writes:

> Fix a memory leak in code added in 5d7eeee2ac6 (git-show: grok blobs,
> trees and tags, too, 2006-12-14). As we iterate over a "<revision>..."
> command-line and encounter ad OBJ_COMMIT we want to use our "struct

"... encounter an OBJ_COMMIT, we want to ..."?

> rev_info", but with a "pending" array of one element: the one commit
> we're showing in the loop.
>
> To do this 5d7eeee2ac6 saved away a pointer to rev.pending.objects and
> rev.pending.nr for its iteration. We'd then clobber those (and alloc)
> when we needed to show an OBJ_COMMIT.
>
> We'd therefore leak the "rev.pending" we started out with, and only
> free the new "rev.pending" in the "OBJ_COMMIT" case arm as
> prepare_revision_walk() would draw it down.
>
> Let's fix this memory leak. Now when we encounter an OBJ_COMMIT we
> save away the "rev.pending" before clearing it. We then add a single
> commit to it, which our indirect invocation of prepare_revision_walk()
> will remove. After that we restore the "rev.pending".
>
> Our "rev.pending" will then get free'd by the release_revisions()
> added in f6bfea0ad01 (revisions API users: use release_revisions() in
> builtin/log.c, 2022-04-13)

Hmph, this gives me a strange sense of deja-vu that I saw a better
solution in a separate patch from you, strange because I do not see
anything at the tip of 'seen' like what I thought you did elsewhere.

We do need to reuse "rev_info" we got from outside the loop so that
we will have to consistently apply diffopt and other things we got
in there from the configuration and the command line.  But after we
decide to go to "each object is shown individually" mode, the
contents of the pending array (rather, what we got from the command
line in cmdline member) is only relevant to enumerate which
individual objects are shown in the loop, and should not even be
visible to the code that handles each individual object, e.g. even
we pass &rev to this code when we see a blob:

		switch (o->type) {
		case OBJ_BLOB:
			ret = show_blob_object(&o->oid, &rev, name);
			break;

there is no point in exposing the rev.pending to show_blob_object()
at al.  The same is true for codepaths used to show a tag or a tree.
When showing a commit, we do not even want the codepath that shows a
single-commit range to touch and destroy the original rev.pending;
we instead want to give a single-element pending array.

So instead of keeping rev.pending when we enter the loop and saving
it away and restoring it, it feels a lot cleaner to invent/use an
interface to "clone" the settings in an existing rev_info by:

 * Grab rev.pending into a separate "struct object_array" that is a
   local variable in cmd_show() and clear rev.pending, immediately
   after we decide to go to "show individually" mode.

 * Iterate over the objects in that local variable.  For each object
   to be shown, prepare a rev_info, clone the setting from the
   original rev_info, put that single object to the pending member
   of the clone, use that cloned rev_info, and release the resources
   held by the cloned rev_info once we are done.

> diff --git a/builtin/log.c b/builtin/log.c
> index 88a5e98875a..b4b1d974617 100644
> --- a/builtin/log.c
> +++ b/builtin/log.c
> @@ -743,11 +743,17 @@ int cmd_show(int argc, const char **argv, const char *prefix)
>  			rev.shown_one = 1;
>  			break;
>  		case OBJ_COMMIT:
> +		{
> +			struct object_array old;
> +
> +			memcpy(&old, &rev.pending, sizeof(old));
>  			rev.pending.nr = rev.pending.alloc = 0;
>  			rev.pending.objects = NULL;
>  			add_object_array(o, name, &rev.pending);
>  			ret = cmd_log_walk_no_free(&rev);
> +			memcpy(&rev.pending, &old, sizeof(rev.pending));
>  			break;
> +		}

The reason why I find this approach a bit disturbing is that we
pretend to know for certain that pending is the only thing we need
to protect across multiple calls to the log_walk(), but I suspect
that such an overconfidence will come back and bite us.  We are not
re-initializing other states reachable from the rev_info (e.g. the
diff_options struct has various members that records what happened
in an invocation, like needed_rename_limit, found_follow, and
found_changes, that would want to be reinitialized if we are
starting a new and totally independent traversal after we are done
one traversal).

But I do not mind at all to leave such a fundamental clean-up to a
totally separate topic, and keep this patch only about "we are
clobbering and leaking rev.pending, let's plug the leak".  In fact,
I would prefer it that way.  So take all of the above as material
for possible NEEDSWORK comment, food for further thought.