Re: [PATCH v2 7/8] csum-file: introduce hashfile_checkpoint_init()

Taylor Blau <me@xxxxxxxxxxxx> · Fri, 17 Jan 2025 16:30:05 -0500

On Fri, Jan 10, 2025 at 04:50:25PM -0500, Taylor Blau wrote:
> On Fri, Jan 10, 2025 at 05:37:56AM -0500, Jeff King wrote:
> > So in the new constructor:
> >
> > > +void hashfile_checkpoint_init(struct hashfile *f,
> > > +			      struct hashfile_checkpoint *checkpoint)
> > > +{
> > > +	memset(checkpoint, 0, sizeof(*checkpoint));
> > > +	f->algop->init_fn(&checkpoint->ctx);
> > > +}
> >
> > ...should we actually record "f" itself? And then in the existing
> > functions:
> >
> > >  void hashfile_checkpoint(struct hashfile *f, struct hashfile_checkpoint *checkpoint)
> >
> > ...they'd no longer need to take the extra parameter.
> >
> > It creates a lifetime dependency of the checkpoint struct on the "f" it
> > is checkpointing, but I think that is naturally modeling the domain.
>
> Thanks, I really like these suggestions. I adjusted the series
> accordingly to do this cleanup in two patches (one for
> hashfile_checkpoint(), another for hashfile_truncate()) after the patch
> introducing hashfile_checkpoint_init().

Hmm. I'm not sure that I like this as much as I thought I did.

I agree with you that ultimately the hashfile_checkpoint is (or should
be) tied to the lifetime of the hashfile that it is checkpointing
underneath. But in practice things are a little funky.

Let's suppose I did something like the following:

--- 8< ---

diff --git a/csum-file.c b/csum-file.c
index ebffc80ef7..47b8317a1f 100644
--- a/csum-file.c
+++ b/csum-file.c
@@ -206,6 +206,15 @@ struct hashfile *hashfd_throughput(int fd, const char *name, struct progress *tp
 	return hashfd_internal(fd, name, tp, 8 * 1024);
 }

+void hashfile_checkpoint_init(struct hashfile *f,
+			      struct hashfile_checkpoint *checkpoint)
+{
+	memset(checkpoint, 0, sizeof(*checkpoint));
+
+	checkpoint->f = f;
+	checkpoint->f->algop->init_fn(&checkpoint->ctx);
+}
+
 void hashfile_checkpoint(struct hashfile *f, struct hashfile_checkpoint *checkpoint)
 {
 	hashflush(f);
diff --git a/csum-file.h b/csum-file.h
index 2b45f4673a..8016509c71 100644
--- a/csum-file.h
+++ b/csum-file.h
@@ -34,8 +34,10 @@ struct hashfile {
 struct hashfile_checkpoint {
 	off_t offset;
 	git_hash_ctx ctx;
+	struct hashfile *f;
 };

+void hashfile_checkpoint_init(struct hashfile *, struct hashfile_checkpoint *);
 void hashfile_checkpoint(struct hashfile *, struct hashfile_checkpoint *);
 int hashfile_truncate(struct hashfile *, struct hashfile_checkpoint *);
--- >8 ---

, where I'm eliding the trivial changes necessary to make this work at
the two callers. Let's look a little closer at the bulk-checkin caller.
If I do this on top:

--- 8< ---
diff --git a/bulk-checkin.c b/bulk-checkin.c
index 433070a3bd..892176d23d 100644
--- a/bulk-checkin.c
+++ b/bulk-checkin.c
@@ -261,7 +261,7 @@ static int deflate_blob_to_pack(struct bulk_checkin_packfile *state,
 	git_hash_ctx ctx;
 	unsigned char obuf[16384];
 	unsigned header_len;
-	struct hashfile_checkpoint checkpoint = {0};
+	struct hashfile_checkpoint checkpoint;
 	struct pack_idx_entry *idx = NULL;

 	seekback = lseek(fd, 0, SEEK_CUR);
@@ -272,12 +272,15 @@ static int deflate_blob_to_pack(struct bulk_checkin_packfile *state,
 					  OBJ_BLOB, size);
 	the_hash_algo->init_fn(&ctx);
 	the_hash_algo->update_fn(&ctx, obuf, header_len);
-	the_hash_algo->unsafe_init_fn(&checkpoint.ctx);

 	/* Note: idx is non-NULL when we are writing */
-	if ((flags & HASH_WRITE_OBJECT) != 0)
+	if ((flags & HASH_WRITE_OBJECT) != 0) {
 		CALLOC_ARRAY(idx, 1);

+		prepare_to_stream(state, flags);
+		hashfile_checkpoint_init(state->f, &checkpoint);
+	}
+
 	already_hashed_to = 0;

 	while (1) {
--- >8 ---

then test 14 in t1050-large.sh fails because of a segfault in 'git add'.
If we compile with SANITIZE=address, we can see that there's a
use-after-free in hashflush(), which is called by hashfile_checkpoint().

That is a result of the max pack-size code. So we could try something
like:

--- 8< ---
diff --git a/bulk-checkin.c b/bulk-checkin.c
index 7b8a6eb2df..9dc114d132 100644
--- a/bulk-checkin.c
+++ b/bulk-checkin.c
@@ -261,7 +261,7 @@ static int deflate_blob_to_pack(struct bulk_checkin_packfile *state,
 	git_hash_ctx ctx;
 	unsigned char obuf[16384];
 	unsigned header_len;
-	struct hashfile_checkpoint checkpoint;
+	struct hashfile_checkpoint checkpoint = { 0 };
 	struct pack_idx_entry *idx = NULL;

 	seekback = lseek(fd, 0, SEEK_CUR);
@@ -274,17 +274,14 @@ static int deflate_blob_to_pack(struct bulk_checkin_packfile *state,
 	the_hash_algo->update_fn(&ctx, obuf, header_len);

 	/* Note: idx is non-NULL when we are writing */
-	if ((flags & HASH_WRITE_OBJECT) != 0) {
+	if ((flags & HASH_WRITE_OBJECT) != 0)
 		CALLOC_ARRAY(idx, 1);
-
-		prepare_to_stream(state, flags);
-		hashfile_checkpoint_init(state->f, &checkpoint);
-	}
-
 	already_hashed_to = 0;

 	while (1) {
 		prepare_to_stream(state, flags);
+		if (checkpoint.f != state->f)
+			hashfile_checkpoint_init(state->f, &checkpoint);
 		if (idx) {
 			hashfile_checkpoint(&checkpoint);
 			idx->offset = state->offset;
--- >8 ---

which would do the trick, but it feels awfully hacky to have the "if
(checkpoint.f != state->f)" check in there, since that feels too
intimately tied to the implementation of the hashfile_checkpoint API for
my comfort.

It would be nice if we could make the checkpoint only declared within
the loop body itself, but we can't because we need to call
hashfile_truncate() outside of the loop.

Anyway, that's all to say that I think that while this is probably
doable in theory, in practice it's kind of a mess, at least currently.
I would rather see if there are other ways to clean up the
deflate_blob_to_pack() function first in a way that made this change
less awkward.

I think the most reasonable course here would be to pursue a minimal
change like the one presented here and then think about further clean up
as a separate step.

Thanks,
Taylor