Re: [PATCH 4/6] pack-bitmap.c: factor out manual `map_pos` manipulation

Taylor Blau <me@xxxxxxxxxxxx> · Fri, 24 Mar 2023 19:08:56 -0400

On Tue, Mar 21, 2023 at 01:56:12PM -0400, Jeff King wrote:
> I like the idea of centralizing the bounds checks here.
>
> I do think copying lseek() is a little awkward, though. As you note, we
> only care about SEEK_SET and SEEK_CUR, and we have to BUG() on anything
> else. Which means we have a run-time check, rather than a compile time
> check. If we had two functions like:
>
>   void bitmap_index_seek_to(struct bitmap_index *bitmap_git, size_t pos)
>   {
> 	bitmap_git->map_pos = pos;
> 	if (bitmap_git->map_pos >= bitmap_git->map_size)
> 	   ...etc...
>   }
>
>   void bitmap_index_incr(struct bitmap_index *bitmap_git, size_t offset)
>   {
> 	bitmap_index_seek_to(bitmap_git, st_add(bitmap_git->map_pos, offset));
>   }
>
> then the compiler can see all cases directly. I dunno how much it
> matters.

Yeah, I think that trying to copy the interface of lseek() was a
mistake, since it ended up creating more awkwardness than anything else.
I like the combination of bitmap_index_seek_to() and _incr(), though
though I called the latter bitmap_index_seek_set() instead.

We've got to be a little reminiscent of lseek() after all ;-).

> The other thing that's interesting from a bounds-checking perspective is
> that you're checking the seek _after_ you've read an item. Which has two
> implications:
>
>   - I don't think we could ever overflow size_t here.

Yep, agreed. Let's drop the st_add() call there entirely, since it's not
doing anything useful and just serving to confuse the reader.

>   - The more interesting case is that we are not overflowing size_t, but
>     simply walking past bitmap_git->map_size. But again, we are reading
>     first and then seeking. So if our seek does go past, then by
>     definition we just read garbage bytes, which is undefined behavior.
>
>     For a bounds-check, wouldn't we want it the other way around? To
>     make sure we have 4 bytes available, and then if so read them and
>     increment the offset?

I thought on first blush that this would be a much larger change, but
actually it's quite small. The EWAH code already checks its reads ahead
of time, so we don't have to worry about those. It's really that
read_be32() and read_u8() do the read-then-seek.

So I think that the change here is limited to making sure that we can
read enough bytes in those functions before actually executing the read.

> > +	if (bitmap_git->map_pos >= bitmap_git->map_size)
> > +		BUG("bitmap position exceeds size (%"PRIuMAX" >= %"PRIuMAX")",
> > +		    (uintmax_t)bitmap_git->map_pos,
> > +		    (uintmax_t)bitmap_git->map_size);
>
> This uses ">=", which is good, because it is valid to walk the pointer
> to one past the end of the map, which is effectively EOF. But as above,
> in that condition the callers should be checking for this EOF state
> before reading the bytes.

I think that having both is useful. Checking before you read avoids
undefined behavior. Checking after you seek ensures that we never have a
bitmap_index struct with a bogus map_pos value.

Thanks,
Taylor