Re: [PATCH 02/30] read-cache: add index.computeHash config option

Ævar Arnfjörð Bjarmason <avarab@xxxxxxxxx> · Thu, 17 Nov 2022 17:13:45 +0100

On Mon, Nov 07 2022, Derrick Stolee via GitGitGadget wrote:

> Summary
>   'without hash' ran
>     1.78 ± 0.76 times faster than 'with hash'
>
> These performance benefits are substantial enough to allow users the
> ability to opt-in to this feature, even with the potential confusion
> with older 'git fsck' versions.

The 0.76 part of that is probably just fs caches etc. screwing things
up. I tried it on a ramdisk with CFLAGS=-O3:

	$ hyperfine -L v false,true './git -c index.computeHash={v} -C /dev/shm/linux update-index --force-write' -w 1 -r 10
	Benchmark 1: ./git -c index.computeHash=false -C /dev/shm/linux update-index --force-write
	  Time (mean ± σ):      13.3 ms ±   0.3 ms    [User: 7.1 ms, System: 6.1 ms]
	  Range (min … max):    12.7 ms …  13.6 ms    10 runs

	Benchmark 2: ./git -c index.computeHash=true -C /dev/shm/linux update-index --force-write
	  Time (mean ± σ):      34.8 ms ±   0.4 ms    [User: 28.9 ms, System: 5.8 ms]
	  Range (min … max):    34.2 ms …  35.1 ms    10 runs

	Summary
	  './git -c index.computeHash=false -C /dev/shm/linux update-index --force-write' ran
	    2.62 ± 0.07 times faster than './git -c index.computeHash=true -C /dev/shm/linux update-index --force-write'

I also see that if I compile with OPENSSL_SHA1=Y, then:

	$ hyperfine -L v false,true './git -c index.computeHash={v} -C /dev/shm/linux update-index --force-write' 
	Benchmark 1: ./git -c index.computeHash=false -C /dev/shm/linux update-index --force-write
	  Time (mean ± σ):      14.0 ms ±   1.3 ms    [User: 7.7 ms, System: 6.2 ms]
	  Range (min … max):    13.1 ms …  21.7 ms    206 runs

	  Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet PC without any interferences from other programs. It might help to use the '--warmup' or '--prepare' 
	options.

	Benchmark 2: ./git -c index.computeHash=true -C /dev/shm/linux update-index --force-write
	  Time (mean ± σ):      21.0 ms ±   1.0 ms    [User: 15.0 ms, System: 6.0 ms]
	  Range (min … max):    20.1 ms …  28.4 ms    138 runs

	  Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet PC without any interferences from other programs. It might help to use the '--warmup' or '--prepare' 
	options.

	Summary
	  './git -c index.computeHash=false -C /dev/shm/linux update-index --force-write' ran
	    1.50 ± 0.15 times faster than './git -c index.computeHash=true -C /dev/shm/linux update-index --force-write'

Which, FWIW is something worth considering. I.e. when we introduced
sha1dc we did so with the "big hammer" of the existing hashing API,
which is all or nothing, and we pick the hash when we compile git.

But that left a lot of things slower for no good reason, e.g. when we do
this hashing of the trailers. So if we could just compile with two
implementations, and give users the choice of "use the faster hash when
you're not communicating with other git repos" we could make things
faster in some cases, without the potential format interop issues.

> From: Derrick Stolee <derrickstolee@xxxxxxxxxx>
> [...]
> +index.computeHash::
> +	When enabled, compute the hash of the index file as it is written
> +	and store the hash at the end of the content. This is enabled by
> +	default.
> ++

If we have a boolean option it makes sense to make its name reflect the
opt-in nature. So "index.skipHash". Then just say "If enabled", and skip
the "this is enabled by default, and then later this code:

> +	int compute_hash;
> [...]
> +	if (!git_config_get_maybe_bool("index.computehash", &compute_hash) &&
> +	    !compute_hash)
> +		f->skip_hash = 1;

Can just become:

	git_config_get_maybe_bool("index.skipHash", &f->skip_hash);

I.e. git_config_get_maybe_bool() leaves the passed-in dest value alone
if it doesn't have it in the config, and you only use this
"compute_hash" as an inverted version of "skip_hash".

> +If you disable `index.computHash`, then older Git clients may report that
> +your index is corrupt during `git fsck`.
> diff --git a/read-cache.c b/read-cache.c
> index 32024029274..f24d96de4d3 100644
> --- a/read-cache.c
> +++ b/read-cache.c
> @@ -1817,6 +1817,8 @@ static int verify_hdr(const struct cache_header *hdr, unsigned long size)
>  	git_hash_ctx c;
>  	unsigned char hash[GIT_MAX_RAWSZ];
>  	int hdr_version;
> +	int all_zeroes = 1;
> +	unsigned char *start, *end;
>  
>  	if (hdr->hdr_signature != htonl(CACHE_SIGNATURE))
>  		return error(_("bad signature 0x%08x"), hdr->hdr_signature);
> @@ -1827,10 +1829,23 @@ static int verify_hdr(const struct cache_header *hdr, unsigned long size)
>  	if (!verify_index_checksum)
>  		return 0;
>  
> +	end = (unsigned char *)hdr + size;
> +	start = end - the_hash_algo->rawsz;
> +	while (start < end) {
> +		if (*start != 0) {
> +			all_zeroes = 0;
> +			break;
> +		}
> +		start++;
> +	}

Didn't you just re-invent oidread()? :)

Just to narrate my way through this. Before we called verify_hdr() we
did:

        hdr = (const struct cache_header *)mmap;
        if (verify_hdr(hdr, mmap_size) < 0)

So, we mmap()'d the index on disk, and whe "hdr" is the struct version
of this data, we then cast that back to an "unsigned char *" here,
because we're interested in just the raw bytes.

Then we "jump to the end" here, and start iterating over the rawsz at
the end, because we're just reading if we have a null_oid().

Then, right after that verify_hdr() call, the veriy next thing we'll do is:

	oidread(&istate->oid, (const unsigned char *)hdr + mmap_size - the_hash_algo->rawsz);

So, maybe I'm missing some subtlety still, and some of this is existing
baggage in the pre-image (we used to have the sha1 in the struct, a
*long* time ago).

But isn't this equivalent?:

	diff --git a/read-cache.c b/read-cache.c
	index f24d96de4d3..39b5b8419f5 100644
	--- a/read-cache.c
	+++ b/read-cache.c
	@@ -1812,13 +1812,14 @@ int verify_index_checksum;
	 /* Allow fsck to force verification of the cache entry order. */
	 int verify_ce_order;

	-static int verify_hdr(const struct cache_header *hdr, unsigned long size)
	+static int verify_hdr(const char *const mmap, const size_t size,
	+		      const struct cache_header **hdrp, struct object_id *oid)
	 {
	+	const struct cache_header *hdr = (const struct cache_header *)mmap;
	 	git_hash_ctx c;
	 	unsigned char hash[GIT_MAX_RAWSZ];
	 	int hdr_version;
	-	int all_zeroes = 1;
	-	unsigned char *start, *end;
	+	const unsigned char *end = (unsigned char *)mmap + size;

	 	if (hdr->hdr_signature != htonl(CACHE_SIGNATURE))
	 		return error(_("bad signature 0x%08x"), hdr->hdr_signature);
	@@ -1826,20 +1827,12 @@ static int verify_hdr(const struct cache_header *hdr, unsigned long size)
	 	if (hdr_version < INDEX_FORMAT_LB || INDEX_FORMAT_UB < hdr_version)
	 		return error(_("bad index version %d"), hdr_version);

	+	*hdrp = hdr;
	+	oidread(oid, end - the_hash_algo->rawsz);
	+
	 	if (!verify_index_checksum)
	 		return 0;
	-
	-	end = (unsigned char *)hdr + size;
	-	start = end - the_hash_algo->rawsz;
	-	while (start < end) {
	-		if (*start != 0) {
	-			all_zeroes = 0;
	-			break;
	-		}
	-		start++;
	-	}
	-
	-	if (all_zeroes)
	+	if (is_null_oid(oid))
	 		return 0;

	 	the_hash_algo->init_fn(&c);
	@@ -2358,11 +2351,8 @@ int do_read_index(struct index_state *istate, const char *path, int must_exist)
	 			mmap_os_err());
	 	close(fd);

	-	hdr = (const struct cache_header *)mmap;
	-	if (verify_hdr(hdr, mmap_size) < 0)
	+	if (verify_hdr(mmap, mmap_size, &hdr, &istate->oid) < 0)
	 		goto unmap;
	-
	-	oidread(&istate->oid, (const unsigned char *)hdr + mmap_size - the_hash_algo->rawsz);
	 	istate->version = ntohl(hdr->hdr_version);
	 	istate->cache_nr = ntohl(hdr->hdr_entries);
	 	istate->cache_alloc = alloc_nr(istate->cache_nr);

I.e. we just make the verify function be in charge of populating our
"oid", which we can do that early, as we'd error out later in the
function if it doesn't match.

We could avoid the "hdrp" there, but if we're doing the cast it's
probably good for readability to just do it once.

> +test_expect_success 'index.computeHash config option' '
> +	(
> +		rm -f .git/index &&
> +		git -c index.computeHash=false add a &&
> +		git fsck
> +	)
> +'

You can skip the subshell here, but for a non-RFC let's leave the test
in a nice state for the next test someone adds, so maybe:

	test_when_finished "rm -rf repo" &&
	git clone . repo &&
	[...]

Lastly, on this again:

> These performance benefits are substantial enough to allow users the
> ability to opt-in to this feature, even with the potential confusion
> with older 'git fsck' versions.

Isn't an unstated major caveat here that it's not "an older verison",
but if you on *your version* set the config to "true" your index doesn't
have a hash, so it's persisted until you wipe the index?