On Tue, Mar 21, 2023 at 01:56:12PM -0400, Jeff King wrote: > I like the idea of centralizing the bounds checks here. > > I do think copying lseek() is a little awkward, though. As you note, we > only care about SEEK_SET and SEEK_CUR, and we have to BUG() on anything > else. Which means we have a run-time check, rather than a compile time > check. If we had two functions like: > > void bitmap_index_seek_to(struct bitmap_index *bitmap_git, size_t pos) > { > bitmap_git->map_pos = pos; > if (bitmap_git->map_pos >= bitmap_git->map_size) > ...etc... > } > > void bitmap_index_incr(struct bitmap_index *bitmap_git, size_t offset) > { > bitmap_index_seek_to(bitmap_git, st_add(bitmap_git->map_pos, offset)); > } > > then the compiler can see all cases directly. I dunno how much it > matters. Yeah, I think that trying to copy the interface of lseek() was a mistake, since it ended up creating more awkwardness than anything else. I like the combination of bitmap_index_seek_to() and _incr(), though though I called the latter bitmap_index_seek_set() instead. We've got to be a little reminiscent of lseek() after all ;-). > The other thing that's interesting from a bounds-checking perspective is > that you're checking the seek _after_ you've read an item. Which has two > implications: > > - I don't think we could ever overflow size_t here. Yep, agreed. Let's drop the st_add() call there entirely, since it's not doing anything useful and just serving to confuse the reader. > - The more interesting case is that we are not overflowing size_t, but > simply walking past bitmap_git->map_size. But again, we are reading > first and then seeking. So if our seek does go past, then by > definition we just read garbage bytes, which is undefined behavior. > > For a bounds-check, wouldn't we want it the other way around? To > make sure we have 4 bytes available, and then if so read them and > increment the offset? I thought on first blush that this would be a much larger change, but actually it's quite small. The EWAH code already checks its reads ahead of time, so we don't have to worry about those. It's really that read_be32() and read_u8() do the read-then-seek. So I think that the change here is limited to making sure that we can read enough bytes in those functions before actually executing the read. > > + if (bitmap_git->map_pos >= bitmap_git->map_size) > > + BUG("bitmap position exceeds size (%"PRIuMAX" >= %"PRIuMAX")", > > + (uintmax_t)bitmap_git->map_pos, > > + (uintmax_t)bitmap_git->map_size); > > This uses ">=", which is good, because it is valid to walk the pointer > to one past the end of the map, which is effectively EOF. But as above, > in that condition the callers should be checking for this EOF state > before reading the bytes. I think that having both is useful. Checking before you read avoids undefined behavior. Checking after you seek ensures that we never have a bitmap_index struct with a bogus map_pos value. Thanks, Taylor