Re: [PATCH] hash: reduce size of algo member of object ID

Ævar Arnfjörð Bjarmason <avarab@xxxxxxxxx> · Mon, 04 Oct 2021 10:47:43 +0200

On Mon, Oct 04 2021, Jeff King wrote:

> On Sun, Oct 03, 2021 at 07:51:40AM +0200, René Scharfe wrote:
>
>> cf0983213c (hash: add an algo member to struct object_id, 2021-04-26)
>> introduced the algo member as an int.  This increased the size of struct
>> object_id by 4 bytes (12.5%) and imposed a 4-byte alignment.  Currently
>> we only need to stored the values 0, 1 and 2 in it.  Let's use an
>> unsigned char instead to reduce the size overhead and lift the alignment
>> requirement.
>> 
>> Signed-off-by: René Scharfe <l.s.r@xxxxxx>
>> ---
>> Not sure how to measure the performance impact of this change.  The perf
>> tests mentioned by cf0983213c don't show much of a difference with
>> GIT_PERF_REPEAT_COUNT=10 for me:
>
> Not surprising. If going from nothing to an int didn't have an impact on
> those tests, then going from an int to char isn't likely to, either.
>
>> 0001.1: rev-list --all                           0.11(0.08+0.02)     0.11(0.08+0.02) +0.0%
>> 0001.2: rev-list --all --objects                 3.04(2.98+0.05)     3.04(2.98+0.05) +0.0%
>> 0001.3: rev-list --parents                       0.05(0.04+0.01)     0.05(0.03+0.01) +0.0%
>> 0001.5: rev-list -- dummy                        0.21(0.20+0.01)     0.21(0.19+0.01) +0.0%
>> 0001.6: rev-list --parents -- dummy              0.22(0.20+0.01)     0.22(0.20+0.01) +0.0%
>> 0001.8: rev-list $commit --not --all             0.06(0.05+0.00)     0.06(0.05+0.00) +0.0%
>> 0001.9: rev-list --objects $commit --not --all   0.06(0.05+0.00)     0.06(0.05+0.00) +0.0%
>
> I do think these probably aren't very good tests. They're going to
> mostly be dealing with actual object structs. Your patch changes the
> size of "struct object_id", but the object structs themselves still need
> 4-byte alignment. And they have a bunch of padding in the first place.
>
> If we instrument "git version --build-options" like this:
>
> diff --git a/help.c b/help.c
> index 3c3bdec213..718e32cadf 100644
> --- a/help.c
> +++ b/help.c
> @@ -11,6 +11,8 @@
>  #include "version.h"
>  #include "refs.h"
>  #include "parse-options.h"
> +#include "blob.h"
> +#include "tag.h"
>  
>  struct category_description {
>  	uint32_t category;
> @@ -663,6 +665,11 @@ void get_version_info(struct strbuf *buf, int show_build_options)
>  		strbuf_addf(buf, "sizeof-long: %d\n", (int)sizeof(long));
>  		strbuf_addf(buf, "sizeof-size_t: %d\n", (int)sizeof(size_t));
>  		strbuf_addf(buf, "shell-path: %s\n", SHELL_PATH);
> +		strbuf_addf(buf, "sizeof-object_id: %d\n", (int)sizeof(struct object_id));
> +		strbuf_addf(buf, "sizeof-commit: %d\n", (int)sizeof(struct commit));
> +		strbuf_addf(buf, "sizeof-blob: %d\n", (int)sizeof(struct blob));
> +		strbuf_addf(buf, "sizeof-tree: %d\n", (int)sizeof(struct tree));
> +		strbuf_addf(buf, "sizeof-tag: %d\n", (int)sizeof(struct tag));
>  		/* NEEDSWORK: also save and output GIT-BUILD_OPTIONS? */
>  	}
>  }
>
> then we can see that your patch doesn't the change the size of any of
> the structs at all! Even the original cf0983213c going from nothing to
> an int changed only "struct blob" (it went from 32 to 36 bytes).
>
> A more interesting test is something that stores a bunch of oids. A good
> candidate is "cat-file --batch-all-objects". In the default mode, it
> uses an oid_array to sort the set of objects. In unordered mode, it uses
> an oidset to mark seen ones.
>
> Here are hyperfine results. The three versions are:
>
>   - none: cf0983213c^
>   - int: cf0983213c
>   - char: cf0983213c with the equivalent of your patch on top
>
> All compiled with "gcc -O2", and run in a fully packed clone of
> linux.git.
>
> It looks like adding the "algo" field did make a big difference for the
> oid_array case, but changing it to a char doesn't seem to help at all:
>
>   $ hyperfine -L v none,int,char './git.{v} cat-file --batch-all-objects --batch-check="%(objectname)"'
>   Benchmark #1: ./git.none cat-file --batch-all-objects --batch-check="%(objectname)"
>     Time (mean ± σ):      1.653 s ±  0.009 s    [User: 1.607 s, System: 0.046 s]
>     Range (min … max):    1.640 s …  1.670 s    10 runs
>    
>   Benchmark #2: ./git.int cat-file --batch-all-objects --batch-check="%(objectname)"
>     Time (mean ± σ):      1.067 s ±  0.012 s    [User: 1.017 s, System: 0.050 s]
>     Range (min … max):    1.053 s …  1.089 s    10 runs
>    
>   Benchmark #3: ./git.char cat-file --batch-all-objects --batch-check="%(objectname)"
>     Time (mean ± σ):      1.092 s ±  0.013 s    [User: 1.046 s, System: 0.046 s]
>     Range (min … max):    1.080 s …  1.116 s    10 runs
>    
>   Summary
>     './git.int cat-file --batch-all-objects --batch-check="%(objectname)"' ran
>       1.02 ± 0.02 times faster than './git.char cat-file --batch-all-objects --batch-check="%(objectname)"'
>       1.55 ± 0.02 times faster than './git.none cat-file --batch-all-objects --batch-check="%(objectname)"'

*Rubs eyes*

Maybe I'm being really stupid here, but doesn't this say the opposite of
your summary? I.e. that adding the "int" field in cf0983213c made it
faster. It was ~1.6s before, now it's ~1.1s? I haven't used "hyperfine"
before, but this output is consistent with that:
    
    $ hyperfine --warmup 2 -L v 1.6,1.0 'sleep {v}'
    Benchmark #1: sleep 1.6
      Time (mean ± σ):      1.604 s ±  0.000 s    [User: 2.2 ms, System: 1.1 ms]
      Range (min … max):    1.603 s …  1.604 s    10 runs
     
    Benchmark #2: sleep 1.0
      Time (mean ± σ):      1.004 s ±  0.000 s    [User: 2.5 ms, System: 0.9 ms]
      Range (min … max):    1.004 s …  1.004 s    10 runs
     
    Summary
      'sleep 1.0' ran
        1.60 ± 0.00 times faster than 'sleep 1.6'

[Goes and reads the downthread
<YVq5XCyLDr0SPBzx@xxxxxxxxxxxxxxxxxxxxxxx> where I see you made the same
discovery, should have probably done that first :)]

Anyway, just for experimenting I played with removing "int algo"
entirely on top of master. Diff on top at the end. That gives the same
sorts of results:
    
    $ hyperfine --warmup 2 -L v algo-02,algo-03,no-algo-02,no-algo-03 '~/g/git/git.{v} cat-file --batch-all-objects --batch-check="%(objectname)"'
    Benchmark #1: ~/g/git/git.algo-02 cat-file --batch-all-objects --batch-check="%(objectname)"
      Time (mean ± σ):      1.283 s ±  0.022 s    [User: 1.217 s, System: 0.066 s]
      Range (min … max):    1.265 s …  1.336 s    10 runs
    
    Benchmark #2: ~/g/git/git.algo-03 cat-file --batch-all-objects --batch-check="%(objectname)"
      Time (mean ± σ):      1.261 s ±  0.032 s    [User: 1.195 s, System: 0.066 s]
      Range (min … max):    1.240 s …  1.343 s    10 runs
    
    Benchmark #3: ~/g/git/git.no-algo-02 cat-file --batch-all-objects --batch-check="%(objectname)"
      Time (mean ± σ):      1.911 s ±  0.034 s    [User: 1.853 s, System: 0.058 s]
      Range (min … max):    1.878 s …  1.977 s    10 runs
    
    Benchmark #4: ~/g/git/git.no-algo-03 cat-file --batch-all-objects --batch-check="%(objectname)"
      Time (mean ± σ):      1.880 s ±  0.011 s    [User: 1.824 s, System: 0.056 s]
      Range (min … max):    1.864 s …  1.905 s    10 runs
    
    Summary
      '~/g/git/git.algo-03 cat-file --batch-all-objects --batch-check="%(objectname)"' ran
        1.02 ± 0.03 times faster than '~/g/git/git.algo-02 cat-file --batch-all-objects --batch-check="%(objectname)"'
        1.49 ± 0.04 times faster than '~/g/git/git.no-algo-03 cat-file --batch-all-objects --batch-check="%(objectname)"'
        1.52 ± 0.05 times faster than '~/g/git/git.no-algo-02 cat-file --batch-all-objects --batch-check="%(objectname)"'

Of course I did that thinking it might be worthwhile to provide say a
compile-time option for SHA-1-only deployments of git, but that's when I
went along with your misreading of the hyperfine output, seems it
actually makes it fasterer :)

Aside: I recently started testing git on HP/UX's aCC, which is quite
whiny about things like this:

    "blame.c", line 2677: warning #4232-D: conversion from "struct object *" to a more strictly aligned type "struct commit *" may cause misaligned access
                found = (struct commit *)obj;

Which I initially thought might be changed by something like René's
patch, but is (I think) due to the timestamp_t in commit.h, presumably
it's 32bit on HP/UX still? And int is 64 (I haven't actually been able
to fully complie git yet, so...).

 Just an aside, but also wondering how if anything that & other
alignment might have to do with these results on platforms/compilers
that are less whiny about it. Or maybe it's all aligned on x86_64
(again, didn't check).

> I'm actually surprised it had this much of an impact. But I guess this
> benchmark really is mostly just memcpy-ing oids into a big array,
> sorting it, and printing the result. If that array is 12% bigger, we'd
> expect at least a 12% speedup. But adding in non-linear elements like
> growing the array (though I guess that is amortized linear) and sorting
> (which picks up an extra log(n) term) make the difference.
>
> It's _kind of_ silly in a sense, since usually you'd ask for other parts
> of the object, which will make the speed difference relatively smaller.
> But just dumping a bunch of oids is actually not an unreasonable thing
> to do. I suspect it got a lot slower with 32-byte GIT_MAX_RAWSZ, too
> (even when you're using 20-byte sha1), but I don't think there's an easy
> way to get out of that.

[I wrote the below before seeing that you misread the "hyperfine"
output, but most of it applies still]:

Not easy, but it might not be super-duper-hard either, I think it might
be a worthwhile thing to try, and these results seem to back that up.

I had a mostly-working patch for that that I never submitted and just
hacked up one afternoon. I was experimenting with it for something
different: To have Git support a SHA-160 (completely non-standard, the
lowest is SHA-224), i.e. a SHA-256 truncated to SHA-1 size.

The advantage being that for say a system that has various
outside-of-git (e.g. DB columns) hardcoded to SHA-1 size you could
convert the repo to SHA-256, but use SHA-1-sized hashes publicly. Less
secure than SHA-256 obviously, but only proportionally to the reduction
in bits, and would still be less hairy than a SHA-1 derivative.

I didn't think it was worthwhile & threw it away after playing with it,
but anyway...

Once you've got that, which requires disconnecting the hard assumption
in hash.h and friends about hash type == hash length, i.e. oid_to_hex()
and friends need to have variants that take 160, 224 etc.

You can also do it the other way around. Say have a format like the
commit-graph reference 80 bit objects, but have the object store stored
in the full 256 bits.

That obviously creates all sorts of hairy edge cases where none exist
today. I.e. if you set the storage hash a SHA-1 repo to be 1/4 the size
that it is now you can't push/pull anything, or if you could we'd need
to piggy-back on the planned on-the-fly SHA-1<->SHA-256 rewriting.

But it might be interesting & useful for these performance
reasons. I.e. it's pretty easy (and I had some throwaway version of this
working with only the "occasional" segfault) to get a copy of "git"
running that say stores all OIDs as 1/4 their size.

As long as you know you've got no collisions you can use that as a quick
side-index. All your problems with interoperability are ones we'll need
to deal with for SHA-1<->SHA-256, although you'll have the extra
potential problem of collisions (which is X% likely depending on your
object numbers & size of truncation you choose).

Diff at end:

diff --git a/builtin/show-index.c b/builtin/show-index.c
index 0e0b9fb95bc..b63a3bf60b9 100644
--- a/builtin/show-index.c
+++ b/builtin/show-index.c
@@ -74,7 +74,6 @@ int cmd_show_index(int argc, const char **argv, const char *prefix)
 		for (i = 0; i < nr; i++) {
 			if (fread(entries[i].oid.hash, hashsz, 1, stdin) != 1)
 				die("unable to read sha1 %u/%u", i, nr);
-			entries[i].oid.algo = hash_algo_by_ptr(the_hash_algo);
 		}
 		for (i = 0; i < nr; i++)
 			if (fread(&entries[i].crc, 4, 1, stdin) != 1)
diff --git a/hash.h b/hash.h
index 9e25c40e9ac..bd1855b65ec 100644
--- a/hash.h
+++ b/hash.h
@@ -115,7 +115,6 @@ static inline void git_SHA256_Clone(git_SHA256_CTX *dst, const git_SHA256_CTX *s
 
 struct object_id {
 	unsigned char hash[GIT_MAX_RAWSZ];
-	int algo;	/* XXX requires 4-byte alignment */
 };
 
 /* A suitably aligned type for stack allocations of hash contexts. */
@@ -213,12 +212,7 @@ static inline int hashcmp(const unsigned char *sha1, const unsigned char *sha2)
 
 static inline int oidcmp(const struct object_id *oid1, const struct object_id *oid2)
 {
-	const struct git_hash_algo *algop;
-	if (!oid1->algo)
-		algop = the_hash_algo;
-	else
-		algop = &hash_algos[oid1->algo];
-	return hashcmp_algop(oid1->hash, oid2->hash, algop);
+	return hashcmp_algop(oid1->hash, oid2->hash, the_hash_algo);
 }
 
 static inline int hasheq_algop(const unsigned char *sha1, const unsigned char *sha2, const struct git_hash_algo *algop)
@@ -239,12 +233,7 @@ static inline int hasheq(const unsigned char *sha1, const unsigned char *sha2)
 
 static inline int oideq(const struct object_id *oid1, const struct object_id *oid2)
 {
-	const struct git_hash_algo *algop;
-	if (!oid1->algo)
-		algop = the_hash_algo;
-	else
-		algop = &hash_algos[oid1->algo];
-	return hasheq_algop(oid1->hash, oid2->hash, algop);
+	return hasheq_algop(oid1->hash, oid2->hash, the_hash_algo);
 }
 
 static inline int is_null_oid(const struct object_id *oid)
@@ -260,23 +249,16 @@ static inline void hashcpy(unsigned char *sha_dst, const unsigned char *sha_src)
 static inline void oidcpy(struct object_id *dst, const struct object_id *src)
 {
 	memcpy(dst->hash, src->hash, GIT_MAX_RAWSZ);
-	dst->algo = src->algo;
 }
 
 /* Like oidcpy() but zero-pads the unused bytes in dst's hash array. */
 static inline void oidcpy_with_padding(struct object_id *dst,
 				       const struct object_id *src)
 {
-	size_t hashsz;
-
-	if (!src->algo)
-		hashsz = the_hash_algo->rawsz;
-	else
-		hashsz = hash_algos[src->algo].rawsz;
+	size_t hashsz = the_hash_algo->rawsz;
 
 	memcpy(dst->hash, src->hash, hashsz);
 	memset(dst->hash + hashsz, 0, GIT_MAX_RAWSZ - hashsz);
-	dst->algo = src->algo;
 }
 
 static inline struct object_id *oiddup(const struct object_id *src)
@@ -294,13 +276,11 @@ static inline void hashclr(unsigned char *hash)
 static inline void oidclr(struct object_id *oid)
 {
 	memset(oid->hash, 0, GIT_MAX_RAWSZ);
-	oid->algo = hash_algo_by_ptr(the_hash_algo);
 }
 
 static inline void oidread(struct object_id *oid, const unsigned char *hash)
 {
 	memcpy(oid->hash, hash, the_hash_algo->rawsz);
-	oid->algo = hash_algo_by_ptr(the_hash_algo);
 }
 
 static inline int is_empty_blob_sha1(const unsigned char *sha1)
@@ -325,7 +305,7 @@ static inline int is_empty_tree_oid(const struct object_id *oid)
 
 static inline void oid_set_algo(struct object_id *oid, const struct git_hash_algo *algop)
 {
-	oid->algo = hash_algo_by_ptr(algop);
+	return;
 }
 
 const char *empty_tree_oid_hex(void);
diff --git a/hex.c b/hex.c
index 4f64d346963..6538e415a37 100644
--- a/hex.c
+++ b/hex.c
@@ -143,7 +143,7 @@ char *hash_to_hex_algop_r(char *buffer, const unsigned char *hash,
 
 char *oid_to_hex_r(char *buffer, const struct object_id *oid)
 {
-	return hash_to_hex_algop_r(buffer, oid->hash, &hash_algos[oid->algo]);
+	return hash_to_hex_algop_r(buffer, oid->hash, &hash_algos[GIT_HASH_SHA1]);
 }
 
 char *hash_to_hex_algop(const unsigned char *hash, const struct git_hash_algo *algop)
@@ -161,5 +161,5 @@ char *hash_to_hex(const unsigned char *hash)
 
 char *oid_to_hex(const struct object_id *oid)
 {
-	return hash_to_hex_algop(oid->hash, &hash_algos[oid->algo]);
+	return hash_to_hex_algop(oid->hash, &hash_algos[GIT_HASH_SHA1]);
 }
diff --git a/http-push.c b/http-push.c
index 3309aaf004a..3ce453e14a4 100644
--- a/http-push.c
+++ b/http-push.c
@@ -1011,8 +1011,6 @@ static void remote_ls(const char *path, int flags,
 /* extract hex from sharded "xx/x{38}" filename */
 static int get_oid_hex_from_objpath(const char *path, struct object_id *oid)
 {
-	oid->algo = hash_algo_by_ptr(the_hash_algo);
-
 	if (strlen(path) != the_hash_algo->hexsz + 1)
 		return -1;
 
diff --git a/object-file.c b/object-file.c
index be4f94ecf3b..be1385c5e72 100644
--- a/object-file.c
+++ b/object-file.c
@@ -58,27 +58,21 @@
 
 static const struct object_id empty_tree_oid = {
 	.hash = EMPTY_TREE_SHA1_BIN_LITERAL,
-	.algo = GIT_HASH_SHA1,
 };
 static const struct object_id empty_blob_oid = {
 	.hash = EMPTY_BLOB_SHA1_BIN_LITERAL,
-	.algo = GIT_HASH_SHA1,
 };
 static const struct object_id null_oid_sha1 = {
 	.hash = {0},
-	.algo = GIT_HASH_SHA1,
 };
 static const struct object_id empty_tree_oid_sha256 = {
 	.hash = EMPTY_TREE_SHA256_BIN_LITERAL,
-	.algo = GIT_HASH_SHA256,
 };
 static const struct object_id empty_blob_oid_sha256 = {
 	.hash = EMPTY_BLOB_SHA256_BIN_LITERAL,
-	.algo = GIT_HASH_SHA256,
 };
 static const struct object_id null_oid_sha256 = {
 	.hash = {0},
-	.algo = GIT_HASH_SHA256,
 };
 
 static void git_hash_sha1_init(git_hash_ctx *ctx)
@@ -105,7 +99,6 @@ static void git_hash_sha1_final_oid(struct object_id *oid, git_hash_ctx *ctx)
 {
 	git_SHA1_Final(oid->hash, &ctx->sha1);
 	memset(oid->hash + GIT_SHA1_RAWSZ, 0, GIT_MAX_RAWSZ - GIT_SHA1_RAWSZ);
-	oid->algo = GIT_HASH_SHA1;
 }
 
 
@@ -137,7 +130,6 @@ static void git_hash_sha256_final_oid(struct object_id *oid, git_hash_ctx *ctx)
 	 * but keep it in case we extend the hash size again.
 	 */
 	memset(oid->hash + GIT_SHA256_RAWSZ, 0, GIT_MAX_RAWSZ - GIT_SHA256_RAWSZ);
-	oid->algo = GIT_HASH_SHA256;
 }
 
 static void git_hash_unknown_init(git_hash_ctx *ctx)
diff --git a/oidtree.c b/oidtree.c
index 0d39389bee2..61f4d9515b5 100644
--- a/oidtree.c
+++ b/oidtree.c
@@ -33,9 +33,6 @@ void oidtree_insert(struct oidtree *ot, const struct object_id *oid)
 	struct cb_node *on;
 	struct object_id k;
 
-	if (!oid->algo)
-		BUG("oidtree_insert requires oid->algo");
-
 	on = mem_pool_alloc(&ot->mem_pool, sizeof(*on) + sizeof(*oid));
 
 	/*
@@ -62,13 +59,6 @@ int oidtree_contains(struct oidtree *ot, const struct object_id *oid)
 
 	oidcpy_with_padding(&k, oid);
 
-	if (oid->algo == GIT_HASH_UNKNOWN)
-		klen -= sizeof(oid->algo);
-
-	/* cb_lookup relies on memcmp on the struct, so order matters: */
-	klen += BUILD_ASSERT_OR_ZERO(offsetof(struct object_id, hash) <
-				offsetof(struct object_id, algo));
-
 	return cb_lookup(&ot->tree, (const uint8_t *)&k, klen) ? 1 : 0;
 }
 
@@ -80,9 +70,6 @@ static enum cb_next iter(struct cb_node *n, void *arg)
 	/* Copy to provide 4-byte alignment needed by struct object_id. */
 	memcpy(&k, n->k, sizeof(k));
 
-	if (x->algo != GIT_HASH_UNKNOWN && x->algo != k.algo)
-		return CB_CONTINUE;
-
 	if (x->last_nibble_at) {
 		if ((k.hash[*x->last_nibble_at] ^ x->last_byte) & 0xf0)
 			return CB_CONTINUE;
@@ -100,7 +87,6 @@ void oidtree_each(struct oidtree *ot, const struct object_id *oid,
 
 	x.fn = fn;
 	x.arg = arg;
-	x.algo = oid->algo;
 	if (oidhexsz & 1) {
 		x.last_byte = oid->hash[klen];
 		x.last_nibble_at = &klen;
diff --git a/t/helper/test-oidtree.c b/t/helper/test-oidtree.c
index 180ee28dd93..6e22b422ddc 100644
--- a/t/helper/test-oidtree.c
+++ b/t/helper/test-oidtree.c
@@ -13,7 +13,7 @@ int cmd__oidtree(int argc, const char **argv)
 	struct oidtree ot;
 	struct strbuf line = STRBUF_INIT;
 	int nongit_ok;
-	int algo = GIT_HASH_UNKNOWN;
+	int algo = GIT_HASH_SHA1;
 
 	oidtree_init(&ot);
 	setup_git_directory_gently(&nongit_ok);
@@ -25,7 +25,6 @@ int cmd__oidtree(int argc, const char **argv)
 		if (skip_prefix(line.buf, "insert ", &arg)) {
 			if (get_oid_hex_any(arg, &oid) == GIT_HASH_UNKNOWN)
 				die("insert not a hexadecimal oid: %s", arg);
-			algo = oid.algo;
 			oidtree_insert(&ot, &oid);
 		} else if (skip_prefix(line.buf, "contains ", &arg)) {
 			if (get_oid_hex(arg, &oid))
@@ -37,7 +36,6 @@ int cmd__oidtree(int argc, const char **argv)
 			memcpy(buf, arg, strlen(arg));
 			buf[hash_algos[algo].hexsz] = '\0';
 			get_oid_hex_any(buf, &oid);
-			oid.algo = algo;
 			oidtree_each(&ot, &oid, strlen(arg), print_oid, NULL);
 		} else if (!strcmp(line.buf, "clear")) {
 			oidtree_clear(&ot);