Re: [PATCH v3 3/9] finalize_object_file(): implement collision check

Taylor Blau <me@xxxxxxxxxxxx> · Mon, 16 Sep 2024 11:54:14 -0400

On Mon, Sep 16, 2024 at 12:45:38PM +0200, Patrick Steinhardt wrote:
> This function compares the exact contents, but isn't that wrong? The
> contents may differ even though the object is the same because the
> object hash is derived from the uncompressed data, whereas we store
> compressed data on disk.
>
> So this might trigger when you have different zlib versions, but also if
> you configure core.compression differently.

Oh, shoot -- you're right for when we call finalize_object_file() on
loose objects.

For packfiles and other spots where we use the hashfile API, the
name of the file depends on the checksum'd contents, so we're safe
there. But for loose objects the, the name of the file is based on the
object hash, which is the hash of the uncompressed contents.

So I think we would need something like this on top:

--- 8< ---

diff --git a/object-file.c b/object-file.c
index 84dd7d0fab..2f1616651e 100644
--- a/object-file.c
+++ b/object-file.c
@@ -1992,10 +1992,8 @@ static int check_collision(const char *filename_a, const char *filename_b)
 	return ret;
 }

-/*
- * Move the just written object into its final resting place.
- */
-int finalize_object_file(const char *tmpfile, const char *filename)
+static int finalize_object_file_1(const char *tmpfile, const char *filename,
+				  unsigned check_collisions)
 {
 	struct stat st;
 	int ret = 0;
@@ -2044,6 +2042,14 @@ int finalize_object_file(const char *tmpfile, const char *filename)
 	return 0;
 }

+/*
+ * Move the just written object into its final resting place.
+ */
+int finalize_object_file(const char *tmpfile, const char *filename)
+{
+	return finalize_object_file_1(tmpfile, filename, 1);
+}
+
 static void hash_object_file_literally(const struct git_hash_algo *algo,
 				       const void *buf, unsigned long len,
 				       const char *type, struct object_id *oid)
@@ -2288,7 +2294,7 @@ static int write_loose_object(const struct object_id *oid, char *hdr,
 			warning_errno(_("failed utime() on %s"), tmp_file.buf);
 	}

-	return finalize_object_file(tmp_file.buf, filename.buf);
+	return finalize_object_file_1(tmp_file.buf, filename.buf, 0);
 }

 static int freshen_loose_object(const struct object_id *oid)
@@ -2410,7 +2416,7 @@ int stream_loose_object(struct input_stream *in_stream, size_t len,
 		strbuf_release(&dir);
 	}

-	err = finalize_object_file(tmp_file.buf, filename.buf);
+	err = finalize_object_file_1(tmp_file.buf, filename.buf, 0);
 	if (!err && compat)
 		err = repo_add_loose_object_map(the_repository, oid, &compat_oid);
 cleanup:
--- >8 ---

But I'd like for some others to take a look at this before I send a new
round that includes this change.

Thanks for catching that!

Thanks,
Taylor