Re: [PATCH v2 5/8] packed-backend: check whether the refname contains NUL characters

shejialuo <shejialuo@xxxxxxxxx> · Wed, 5 Feb 2025 18:09:16 +0800

On Mon, Feb 03, 2025 at 09:40:22AM +0100, Patrick Steinhardt wrote:
> On Thu, Jan 30, 2025 at 12:07:46PM +0800, shejialuo wrote:
> > We have already implemented the header consistency check for the raw
> > "packed-refs" file. Before we implement the consistency check for each
> > ref entry, let's analysis [1] which reports that "git fsck" cannot
> > detect some NUL characters.
> 
> This paragraph doesn't quite parse. I think it can simply be left out,
> as the remainder of the commit message already explains in more than
> enough detail what you're doing.
> 

Let me improve this in the next version.

> > "packed-backend.c::next_record" will use "check_refname_format" to check
> > the consistency of the refname. If it is not OK, the program will die.
> > So, we already have the code path and we must miss out something.
> > 
> > We use the following code to get the refname:
> > 
> >     strbuf_add(&iter->refname_buf, p, eol - p);
> >     iter->base.refname = iter->refname_buf.buf
> > 
> > In the above code, `p` is the start pointer of the refname and `eol` is
> > the next newline pointer. We calculate the length of the refname by
> > subtracting the two pointers. Then we add the memory range between `p`
> > and `eol` to get the refname.
> > 
> > However, if there are some NUL characters in the memory range between `p`
> > and `eol`, we will see the refname as a valid ref name as long as the
> > memory range between `p` and first occurred NUL character is valid.
> > 
> > In order to catch above corruption, create a new function
> > "refname_contains_nul" by searching the first NUL character. If it is
> > not at the end of the string, there must be some NUL characters in the
> > refname.
> > 
> > Use this function in "next_record" function to die the program if
> > "refname_contains_nul" returns true.
> 
> Yeah, makes sense to me. NUL bytes are invalid, and nothing good can
> come out of it.
> 
> > diff --git a/refs/packed-backend.c b/refs/packed-backend.c
> > index 883189f3a1..870c8e7aaa 100644
> > --- a/refs/packed-backend.c
> > +++ b/refs/packed-backend.c
> > @@ -494,6 +494,22 @@ static void verify_buffer_safe(struct snapshot *snapshot)
> >  				 last_line, eof - last_line);
> >  }
> >  
> > +/*
> > + * When parsing the "packed-refs" file, we will parse it line by line.
> > + * Because we know the start pointer of the refname and the next
> > + * newline pointer, we could calculate the length of the refname by
> > + * subtracting the two pointers. However, there is a corner case where
> > + * the refname contains corrupted embedded NUL characters. And
> > + * `check_refname_format()` will not catch this when the truncated
> > + * refname is still a valid refname. To prevent this, we need to check
> > + * whether the refname contains the NUL characters.
> > + */
> > +static int refname_contains_nul(struct strbuf *refname)
> > +{
> > +	const char *pos = memchr(refname->buf, '\0', refname->len + 1);
> > +	return pos < refname->buf + refname->len;
> > +}
> 
> This can be simplified to:
> 
>     return !!memchr(refname->buf, '\0', refname->len);
> 

This is very nice.

> Ideally, we'd be amending `check_refname_format()` to do the checking
> for us. But we can't without a wider refactoring because that function
> gets a C string, and C strings are naturally terminadet by NUL
> characters.
> 

Yes, we cannot. Actually, this is a corner case. NUL character is so
special.

> I think that adding a new function for this is a bit over the top
> though, as the check is unlikely to be useful in a lot of places and the
> logic is rather trivial. So I'd just inline the check into
> `next_record()`.
> 

The reason why I extract this logic into a separate function is that we
will reuse this logic for later packed backend consistency checking. We
nearly use the same way to parse the raw "packed-ref" files. So, I don't
want to repeat here.

I will improve the commit message to add the motivation why we need to
use a function instead of using it in the inline way.

Thanks,
Jialuo

> Patrick