Jeff King <peff@xxxxxxxx> writes: > On Mon, Jan 15, 2024 at 02:31:12PM -0800, Linus Arver wrote: > >> > +static int pair_chunk_expect_fn(const unsigned char *chunk_start, >> > + size_t chunk_size, >> > + void *data) >> > +{ >> > + struct pair_chunk_data *pcd = data; >> > + if (chunk_size / pcd->record_size != pcd->record_nr) >> > + return -1; >> > + *pcd->p = chunk_start; >> > + return 0; >> > +} >> > + >> >> I don't think this function should assume anything about the inputs >> (chunk_size, record_size, nor record_nr). If we are asking the helper to >> be the common tool for multiple locations, it should assume even less >> about what these inputs could look like. >> >> For example, if record_size is 0 this will lead to a >> divide-by-zero. Likewise, if record_nr is zero it doesn't make >> sense to check if chunk_size fits into record_size * record_nr. > > I'm not sure that divide-by-zero is a big deal, because 0-length > fixed-size records do not make any sense to ask about. So why not make this function check for this? While it may be true that 0-length fixed-size records are impossible currently, nothing guarantees they will always be that way all the time in the future. > And even if > somebody accidentally passed 0, even though it won't be caught by the > compiler, it would barf on any input, so even rudimentary testing would > catch it. If somebody is accidentally passing an invalid value to a function, then it feels right for that function to be able to handle it instead of crashing (or doing any other undefined behavior) with division-by-zero. Taking a step back though, maybe I am being overly defensive about possible failure modes. I don't know the surrounding area well so I might be overreacting. > A zero record_nr is a perfectly reasonable thing to ask about. If you > have a chunk file with zero entries for some entity, then we are > checking that the chunk is the expected zero length as well. Right. >> And even if there are no (unexpected) zero-values involved, shouldn't we >> also check for nonsensical comparisons, such as if chunk_size is smaller >> than record_size? > > Aren't we checking that already? If chunk_size is less than record_size, > then the division will result in "0". If the expected number of records > is not also 0, then that's a bogus file. I was thinking of an early return like if (chunk_size < record_size) return CHUNK_TOO_SMALL before evaluating (chunk_size / pcd->record_size != pcd->record_nr). You're correct that the division will result in "0" if chunk_size is less than record_size. But I didn't like having the extra mental load for reading and understanding the correctness of "if (chunk_size / pcd->record_size != pcd->record_nr)" for that case. IOW, the more early returns we have for known-bad cases, by the time we get to "if (chunk_size / pcd->record_size != pcd->record_nr)" it would be that much easier to understand that code. > What we really care about here is that we won't walk off the end of the > buffer while looking at N fixed-size records ... Ah, I see. This sort of insight would be great to have as a comment in the code. > ... (in that sense, the "too > big" test is overly careful, but it's easy to include as a sanity > check). OK. >> So in summary there appear to be the following possibilities: >> >> CHUNK_MISSING >> CHUNK_TOO_SMALL >> CHUNK_OK >> CHUNK_TOO_BIG_ALIGNED >> CHUNK_TOO_BIG_MISALIGNED > > So yes, I agree all these cases exist. We detect them all except the > "misaligned" case (which I think was discussed earlier in the thread as > not worth caring too much about). OK. > But... > >> (on top of the cases where record_* inputs are zero). >> >> And it seems prudent to treat each of the not-OK cases differently >> (including how we report error messages) once we know which category we >> fall into. > > I'm not sure it is worth caring too much about the different cases. From > the caller's perspective the end result is always to avoid using the > chunk/file. Ah OK. Then yes, it does seem like caring about the different cases is too much from the callers' perspective. But I do think that checking the different cases with early returns will at least help readability (and as a bonus assure future Git developers that divide-by-zero errors are impossible).