On Tue, Nov 05, 2024 at 03:05:07AM +0000, Derrick Stolee via GitGitGadget wrote: > Test this tree > ----------------------------------------------------------------- > 5314.1: paths at head 4.5K > 5314.2: number of distinct name-hashes 4.1K > 5314.3: number of distinct full-name-hashes 4.5K > 5314.4: maximum multiplicity of name-hashes 13 > 5314.5: maximum multiplicity of fullname-hashes 1 > > Here, the maximum collision multiplicity is 13, but around 10% of paths > have a collision with another path. Neat. > diff --git a/t/helper/test-name-hash.c b/t/helper/test-name-hash.c > new file mode 100644 > index 00000000000..e4ecd159b76 > --- /dev/null > +++ b/t/helper/test-name-hash.c > @@ -0,0 +1,24 @@ > +/* > + * test-name-hash.c: Read a list of paths over stdin and report on their > + * name-hash and full name-hash. > + */ > + > +#include "test-tool.h" > +#include "git-compat-util.h" > +#include "pack-objects.h" > +#include "strbuf.h" > + > +int cmd__name_hash(int argc UNUSED, const char **argv UNUSED) > +{ > + struct strbuf line = STRBUF_INIT; > + > + while (!strbuf_getline(&line, stdin)) { > + uint32_t name_hash = pack_name_hash(line.buf); > + uint32_t full_hash = pack_full_name_hash(line.buf); > + > + printf("%10"PRIu32"\t%10"PRIu32"\t%s\n", name_hash, full_hash, line.buf); I'm definitely nitpicking, but having a tab to separate these two 32-bit values feels odd when we know already that they will be at most 10-characters wide. I probably would have written: printf("%10"PRIu32" %10"PRIu32"\t%s\n", name_hash, full_hash, line.buf); instead, but this is obviously not a big deal either way ;-). > diff --git a/t/perf/p5314-name-hash.sh b/t/perf/p5314-name-hash.sh > new file mode 100755 > index 00000000000..9fe26612fac > --- /dev/null > +++ b/t/perf/p5314-name-hash.sh > @@ -0,0 +1,41 @@ > +#!/bin/sh > + > +test_description='Tests pack performance using bitmaps' > +. ./perf-lib.sh > + > +GIT_TEST_PASSING_SANITIZE_LEAK=0 > +export GIT_TEST_PASSING_SANITIZE_LEAK Does this conflict with Patrick's series to remove these leak checking annotations? I think it might, which is not unexpected given this series was written before that one (and it's my fault for not reviewing it earlier). > +test_perf_large_repo > + > +test_size 'paths at head' ' > + git ls-tree -r --name-only HEAD >path-list && > + wc -l <path-list > +' > + > +test_size 'number of distinct name-hashes' ' > + cat path-list | test-tool name-hash >name-hashes && > + cat name-hashes | awk "{ print \$1; }" | sort -n | uniq -c >name-hash-count && In these two (and a handful of others lower down in this same script) the "cat ... |" is unnecessary. I think this one should be written as: test-tool name-hash <path-list >name-hashes && awk "{ print \$1; }" <name-hashes | sort | uniq -c >name-hash-count && (sort -n is unnecessary, since we just care about getting the list in sorted order so that "uniq -c" can count the number of unique values). > + wc -l <name-hash-count > +' > + > +test_size 'number of distinct full-name-hashes' ' > + cat name-hashes | awk "{ print \$2; }" | sort -n | uniq -c >full-name-hash-count && > + wc -l <full-name-hash-count > +' > + > +test_size 'maximum multiplicity of name-hashes' ' > + cat name-hash-count | \ > + sort -nr | \ > + head -n 1 | \ > + awk "{ print \$1; }" > +' > + > +test_size 'maximum multiplicity of fullname-hashes' ' > + cat full-name-hash-count | \ > + sort -nr | \ > + head -n 1 | \ > + awk "{ print \$1; }" Nitpicking again, but you could extract the "sort | head | awk" pipeline into a function. Thanks, Taylor