Re: [PATCH 1/4] hashfile: allow skipping the hash function

Ævar Arnfjörð Bjarmason <avarab@xxxxxxxxx> · Wed, 07 Dec 2022 23:13:15 +0100

On Wed, Dec 07 2022, Derrick Stolee via GitGitGadget wrote:

> From: Derrick Stolee <derrickstolee@xxxxxxxxxx>
> [...]
> However, hashing the file contents during write comes at a performance
> penalty. It's slower to hash the bytes on their way to the disk than
> without that step. This problem is made worse by the replacement of
> hardware-accelerated SHA1 computations with the software-based sha1dc
> computation.

More on that lack of HW accel later...

> This write cost is significant

Don't you mean hashing cost, or do we also do additional writes if we do
the hashing?

> , and the checksum capability is likely
> not worth that cost for such a short-lived file. The index is rewritten
> frequently and the only time the checksum is checked is during 'git
> fsck'. Thus, it would be helpful to allow a user to opt-out of the hash
> computation.

I didn't know that, and had assumed that we at least checked it on the
full read (and I found this bit of the commit message after writing the
last paragraphs here at the end, so maybe skipping this is fine...).

> [...]
> @@ -64,7 +65,12 @@ int finalize_hashfile(struct hashfile *f, unsigned char *result,
>  	int fd;
>  
>  	hashflush(f);
> -	the_hash_algo->final_fn(f->buffer, &f->ctx);
> +
> +	if (f->skip_hash)
> +		memset(f->buffer, 0, the_hash_algo->rawsz);

Here you're hardcoding a new version of null_oid(), but we can use it
instead. Perhaps:

	diff --git a/csum-file.c b/csum-file.c
	index 3243473c3d7..b54c4f66cbb 100644
	--- a/csum-file.c
	+++ b/csum-file.c
	@@ -63,11 +63,12 @@ int finalize_hashfile(struct hashfile *f, unsigned char *result,
	 		      enum fsync_component component, unsigned int flags)
	 {
	 	int fd;
	+	const struct object_id *const noid = null_oid();

	 	hashflush(f);

	 	if (f->skip_hash)
	-		memset(f->buffer, 0, the_hash_algo->rawsz);
	+		memcpy(f->buffer, noid, sizeof(*noid));
	 	else
	 		the_hash_algo->final_fn(f->buffer, &f->ctx);

> @@ -153,6 +160,7 @@ static struct hashfile *hashfd_internal(int fd, const char *name,
>  	f->tp = tp;
>  	f->name = name;
>  	f->do_crc = 0;
> +	f->skip_hash = 0;
>  	the_hash_algo->init_fn(&f->ctx);
>  
>  	f->buffer_len = buffer_len;

I think I pointed out in the RFC that we'd be much faster with
non-sha1collisiondetection, and that maybe this would get us partway to
the performance you desired (or maybe we'd think that was a more
acceptable trade-off, as it didn't make the format
backwards-incompatible).

But just from seeing "do_crc" here in the context, did you benchmark it
against that? How does it perform?

There's no place to put that in the index, but we *could* encode it in
the hash, just with a lot of leading zeros.

Maybe that would give us some/most of the performance benefits, with the
benefit of a checksum?

Or maybe not, but I think it's worth exploring & supporting a different
& faster SHA-1 implementation before making (even opt-in) backwards
incompatible format changes for performance reasons, and if even that's
too slow maybe crc32 would be sufficient (but not compatible), but still
safer?