Re: [PATCH v2 7/8] csum-file: introduce hashfile_checkpoint_init()

Jeff King <peff@xxxxxxxx> · Sat, 18 Jan 2025 07:15:14 -0500

On Fri, Jan 17, 2025 at 04:30:05PM -0500, Taylor Blau wrote:

> diff --git a/bulk-checkin.c b/bulk-checkin.c
> index 433070a3bd..892176d23d 100644
> --- a/bulk-checkin.c
> +++ b/bulk-checkin.c
> @@ -261,7 +261,7 @@ static int deflate_blob_to_pack(struct bulk_checkin_packfile *state,
>  	git_hash_ctx ctx;
>  	unsigned char obuf[16384];
>  	unsigned header_len;
> -	struct hashfile_checkpoint checkpoint = {0};
> +	struct hashfile_checkpoint checkpoint;
>  	struct pack_idx_entry *idx = NULL;
> 
>  	seekback = lseek(fd, 0, SEEK_CUR);
> @@ -272,12 +272,15 @@ static int deflate_blob_to_pack(struct bulk_checkin_packfile *state,
>  					  OBJ_BLOB, size);
>  	the_hash_algo->init_fn(&ctx);
>  	the_hash_algo->update_fn(&ctx, obuf, header_len);
> -	the_hash_algo->unsafe_init_fn(&checkpoint.ctx);
> 
>  	/* Note: idx is non-NULL when we are writing */
> -	if ((flags & HASH_WRITE_OBJECT) != 0)
> +	if ((flags & HASH_WRITE_OBJECT) != 0) {
>  		CALLOC_ARRAY(idx, 1);
> 
> +		prepare_to_stream(state, flags);
> +		hashfile_checkpoint_init(state->f, &checkpoint);
> +	}
> +
>  	already_hashed_to = 0;
> 
>  	while (1) {

Yeah, that's ugly. We are potentially throwing away the hashfile that
the checkpoint was created for. That makes my instinct to push the
checkpoint down into the loop where we we might restart a new pack, like
this (and like you suggested below):

diff --git a/bulk-checkin.c b/bulk-checkin.c
index 433070a3bd..efa59074fb 100644
--- a/bulk-checkin.c
+++ b/bulk-checkin.c
@@ -261,7 +261,6 @@ static int deflate_blob_to_pack(struct bulk_checkin_packfile *state,
 	git_hash_ctx ctx;
 	unsigned char obuf[16384];
 	unsigned header_len;
-	struct hashfile_checkpoint checkpoint = {0};
 	struct pack_idx_entry *idx = NULL;
 
 	seekback = lseek(fd, 0, SEEK_CUR);
@@ -281,8 +280,10 @@ static int deflate_blob_to_pack(struct bulk_checkin_packfile *state,
 	already_hashed_to = 0;
 
 	while (1) {
+		struct hashfile_checkpoint checkpoint = {0};
 		prepare_to_stream(state, flags);
 		if (idx) {
+			hashfile_checkpoint_init(state->f, &checkpoint);
 			hashfile_checkpoint(state->f, &checkpoint);
 			idx->offset = state->offset;
 			crc32_begin(state->f);

but that doesn't work, because the checkpoint is also used later for the
already_written() check:

        if (already_written(state, result_oid)) {
                hashfile_truncate(state->f, &checkpoint);
                state->offset = checkpoint.offset;
                free(idx);
	} else

That made me wonder if there is a bug lurking there. What if we found
the pack was too big, truncated to our checkpoint, and then opened a new
pack? Then the original checkpoint would now be bogus! It would mention
an offset in the original packfile which doesn't make any sense with
what we have open. But I think this is OK, because we can only leave the
loop when stream_blob_to_pack() returns, and we always establish a new
checkpoint before then.

So I do think that moving the initialization of the checkpoint into the
loop, but _not_ moving the variable would work the same way it does now
(i.e., what you suggested below).

But I admit that the way this loop works kind of makes my head spin. It
can really only ever run twice, but it is hard to see: we break out if
stream_blob_to_pack() returns success. And it will only return error if
we would bust the packsize limit (all other errors cause us to die()).
And only if we would bust the limit _and_ we are not the only object in
the pack. And since we start a new pack if we loop, that will never be
true on the second iteration; we'll always either die() or return
success.

I do think it would be much easier to read with a single explicit retry:

  if (checkpoint_and_try_to_stream() < 0) {
	/* we busted the limit; make a new pack and try again */
	hashfile_truncate();
	etc...
        if (checkpoint_and_try_to_stream() < 0)
		BUG("yikes, we should not fail a second time!");
  }

where checkpoint_and_try_to_stream() is the first half of the loop, up
to the stream_blob_to_pack() call.

Anyway, that is all outside of your patch, and relevant only because
_if_ we untangled it a bit more, it might make the checkpoint lifetime a
bit more obvious and less scary to refactor.

But it does imply to me that the data dependency introduced by my
suggestion is not always so straight-forward as I thought it would be,
and we should probably punt on it for your series.

> which would do the trick, but it feels awfully hacky to have the "if
> (checkpoint.f != state->f)" check in there, since that feels too
> intimately tied to the implementation of the hashfile_checkpoint API for
> my comfort.

I think you could unconditionally checkpoint at that part; we're about
to do a write, so we want to store the state before the write in case we
need to roll back.

> Anyway, that's all to say that I think that while this is probably
> doable in theory, in practice it's kind of a mess, at least currently.
> I would rather see if there are other ways to clean up the
> deflate_blob_to_pack() function first in a way that made this change
> less awkward.

Yeah, I actually wrote what I wrote above before reading this far down
in your email, but we arrived at the exact same conclusion. ;) Hopefully
what I wrote might give some pointers if somebody wants to refactor
later.

> I think the most reasonable course here would be to pursue a minimal
> change like the one presented here and then think about further clean up
> as a separate step.

Yep. Thanks for looking into it.

-Peff