Re: [PATCH 07/28] commit: avoid leaking already-saved buffer

Patrick Steinhardt <ps@xxxxxx> · Thu, 26 Sep 2024 15:50:05 +0200

On Tue, Sep 24, 2024 at 05:54:34PM -0400, Jeff King wrote:
> When we parse a commit via repo_parse_commit_internal(), if
> save_commit_buffer is set we'll stuff the buffer of the object contents
> into a cache, overwriting any previous value.

Interesting. I saw some cases that I think could be caused by this, but
couldn't make much sense of them.

> This can result in a leak of that previously cached value, though it's
> rare in practice. If we have a value in the cache it would have come
> from a previous parse, and during that parse we'd set the object.parsed
> flag, causing any subsequent parse attempts to exit without doing any
> work.
> 
> But it's possible to "unparse" a commit, which we do when registering a
> commit graft. And since shallow fetches are implemented using grafts,
> the leak is triggered in practice by t5539.
> 
> There are a number of possible ways to address this:
> 
>   1. the unparsing function could clear the cached commit buffer, too. I

s/the/The/

>      think this would work for the case I found, but I'm not sure if
>      there are other ways to end up in the same state (an unparsed
>      commit with an entry in the commit buffer cache).
> 
>   2. when we parse, we could check the buffer cache and prefer it to

s/when/When/

>      reading the contents from the object database. In theory the
>      contents of a particular sha1 are immutable, but the code in
>      question is violating the immutability with grafts. So this
>      approach makes me a bit nervous, although I think it would work in
>      practice (the grafts are applied to what we parse, but we still
>      retain the original contents).
> 
>   3. We could realize the cache is already populated and discard its
>      contents before overwriting. It's possible some other code could be
>      holding on to a pointer to the old cache entry (and we'd introduce
>      a use-after-free), but I think the risk of that is relatively low.
> 
>   4. The reverse of (3): when the cache is populated, don't bother
>      saving our new copy. This is perhaps a little weird, since we'll
>      have just populated the commit struct based on a different buffer.
>      But the two buffers should be the same, even in the presence of
>      grafts (as in (2) above).
> 
> I went with option 4. It addresses the leak directly and doesn't carry
> any risk of breaking other assumptions. And it's the same technique used
> by parse_object_buffer() for this situation, though I'm not sure when it
> would even come up there. The extra safety has been there since
> bd1e17e245 (Make "parse_object()" also fill in commit message buffer
> data., 2005-05-25).

Okay. This feels a bit weird indeed, but the fact that we already use
the same strategy in other places makes me feel way safer.

> This lets us mark t5539 as leak-free.
> 
> Signed-off-by: Jeff King <peff@xxxxxxxx>
> ---
>  commit.c                      | 3 ++-
>  t/t5539-fetch-http-shallow.sh | 1 +
>  2 files changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/commit.c b/commit.c
> index 3a54e4db0d..cc03a93036 100644
> --- a/commit.c
> +++ b/commit.c
> @@ -595,7 +595,8 @@ int repo_parse_commit_internal(struct repository *r,
>  	}
>  
>  	ret = parse_commit_buffer(r, item, buffer, size, 0);
> -	if (save_commit_buffer && !ret) {
> +	if (save_commit_buffer && !ret &&
> +	    !get_cached_commit_buffer(r, item, NULL)) {
>  		set_commit_buffer(r, item, buffer, size);
>  		return 0;
>  	}

And the fix is trivial.

Patrick