On Fri, Jan 17, 2025 at 04:30:05PM -0500, Taylor Blau wrote: > diff --git a/bulk-checkin.c b/bulk-checkin.c > index 433070a3bd..892176d23d 100644 > --- a/bulk-checkin.c > +++ b/bulk-checkin.c > @@ -261,7 +261,7 @@ static int deflate_blob_to_pack(struct bulk_checkin_packfile *state, > git_hash_ctx ctx; > unsigned char obuf[16384]; > unsigned header_len; > - struct hashfile_checkpoint checkpoint = {0}; > + struct hashfile_checkpoint checkpoint; > struct pack_idx_entry *idx = NULL; > > seekback = lseek(fd, 0, SEEK_CUR); > @@ -272,12 +272,15 @@ static int deflate_blob_to_pack(struct bulk_checkin_packfile *state, > OBJ_BLOB, size); > the_hash_algo->init_fn(&ctx); > the_hash_algo->update_fn(&ctx, obuf, header_len); > - the_hash_algo->unsafe_init_fn(&checkpoint.ctx); > > /* Note: idx is non-NULL when we are writing */ > - if ((flags & HASH_WRITE_OBJECT) != 0) > + if ((flags & HASH_WRITE_OBJECT) != 0) { > CALLOC_ARRAY(idx, 1); > > + prepare_to_stream(state, flags); > + hashfile_checkpoint_init(state->f, &checkpoint); > + } > + > already_hashed_to = 0; > > while (1) { Yeah, that's ugly. We are potentially throwing away the hashfile that the checkpoint was created for. That makes my instinct to push the checkpoint down into the loop where we we might restart a new pack, like this (and like you suggested below): diff --git a/bulk-checkin.c b/bulk-checkin.c index 433070a3bd..efa59074fb 100644 --- a/bulk-checkin.c +++ b/bulk-checkin.c @@ -261,7 +261,6 @@ static int deflate_blob_to_pack(struct bulk_checkin_packfile *state, git_hash_ctx ctx; unsigned char obuf[16384]; unsigned header_len; - struct hashfile_checkpoint checkpoint = {0}; struct pack_idx_entry *idx = NULL; seekback = lseek(fd, 0, SEEK_CUR); @@ -281,8 +280,10 @@ static int deflate_blob_to_pack(struct bulk_checkin_packfile *state, already_hashed_to = 0; while (1) { + struct hashfile_checkpoint checkpoint = {0}; prepare_to_stream(state, flags); if (idx) { + hashfile_checkpoint_init(state->f, &checkpoint); hashfile_checkpoint(state->f, &checkpoint); idx->offset = state->offset; crc32_begin(state->f); but that doesn't work, because the checkpoint is also used later for the already_written() check: if (already_written(state, result_oid)) { hashfile_truncate(state->f, &checkpoint); state->offset = checkpoint.offset; free(idx); } else That made me wonder if there is a bug lurking there. What if we found the pack was too big, truncated to our checkpoint, and then opened a new pack? Then the original checkpoint would now be bogus! It would mention an offset in the original packfile which doesn't make any sense with what we have open. But I think this is OK, because we can only leave the loop when stream_blob_to_pack() returns, and we always establish a new checkpoint before then. So I do think that moving the initialization of the checkpoint into the loop, but _not_ moving the variable would work the same way it does now (i.e., what you suggested below). But I admit that the way this loop works kind of makes my head spin. It can really only ever run twice, but it is hard to see: we break out if stream_blob_to_pack() returns success. And it will only return error if we would bust the packsize limit (all other errors cause us to die()). And only if we would bust the limit _and_ we are not the only object in the pack. And since we start a new pack if we loop, that will never be true on the second iteration; we'll always either die() or return success. I do think it would be much easier to read with a single explicit retry: if (checkpoint_and_try_to_stream() < 0) { /* we busted the limit; make a new pack and try again */ hashfile_truncate(); etc... if (checkpoint_and_try_to_stream() < 0) BUG("yikes, we should not fail a second time!"); } where checkpoint_and_try_to_stream() is the first half of the loop, up to the stream_blob_to_pack() call. Anyway, that is all outside of your patch, and relevant only because _if_ we untangled it a bit more, it might make the checkpoint lifetime a bit more obvious and less scary to refactor. But it does imply to me that the data dependency introduced by my suggestion is not always so straight-forward as I thought it would be, and we should probably punt on it for your series. > which would do the trick, but it feels awfully hacky to have the "if > (checkpoint.f != state->f)" check in there, since that feels too > intimately tied to the implementation of the hashfile_checkpoint API for > my comfort. I think you could unconditionally checkpoint at that part; we're about to do a write, so we want to store the state before the write in case we need to roll back. > Anyway, that's all to say that I think that while this is probably > doable in theory, in practice it's kind of a mess, at least currently. > I would rather see if there are other ways to clean up the > deflate_blob_to_pack() function first in a way that made this change > less awkward. Yeah, I actually wrote what I wrote above before reading this far down in your email, but we arrived at the exact same conclusion. ;) Hopefully what I wrote might give some pointers if somebody wants to refactor later. > I think the most reasonable course here would be to pursue a minimal > change like the one presented here and then think about further clean up > as a separate step. Yep. Thanks for looking into it. -Peff