Re: [PATCH 00/23] Multi-pack-index (MIDX)

Ævar Arnfjörð Bjarmason <avarab@xxxxxxxxx> · Thu, 07 Jun 2018 16:45:18 +0200

On Thu, Jun 07 2018, Derrick Stolee wrote:

> To test the performance in this situation, I created a
> script that organizes the Linux repository in a similar
> fashion. I split the commit history into 50 parts by
> creating branches on every 10,000 commits of the first-
> parent history. Then, `git rev-list --objects A ^B`
> provides the list of objects reachable from A but not B,
> so I could send that to `git pack-objects` to create
> these "time-based" packfiles. With these 50 packfiles
> (deleting the old one from my fresh clone, and deleting
> all tags as they were no longer on-disk) I could then
> test 'git rev-list --objects HEAD^{tree}' and see:
>
>         Before: 0.17s
>         After:  0.13s
>         % Diff: -23.5%
>
> By adding logic to count hits and misses to bsearch_pack,
> I was able to see that the command above calls that
> method 266,930 times with a hit rate of 33%. The MIDX
> has the same number of calls with a 100% hit rate.

Do you have the script you used for this? It would be very interesting
as something we could stick in t/perf/ to test this use-case in the
future.

How does this & the numbers below compare to just a naïve
--max-pack-size=<similar size> on linux.git?

Is it possible for you to tar this test repo up and share it as a
one-off? I've been polishing the core.validateAbbrev series I have, and
it would be interesting to compare some of the (abbrev) numbers.

> Abbreviation Speedups
> ---------------------
>
> To fully disambiguate an abbreviation, we must iterate
> through all packfiles to ensure no collision exists in
> any packfile. This requires O(P log N) time. With the
> MIDX, this is only O(log N) time. Our standard test [2]
> is 'git log --oneline --parents --raw' because it writes
> many abbreviations while also doing a lot of other work
> (walking commits and trees to compute the raw diff).
>
> For a copy of the Linux repository with 50 packfiles
> split by time, we observed the following:
>
>         Before: 100.5 s
>         After:   58.2 s
>         % Diff: -59.7%
>
>
> Request for Review Attention
> ----------------------------
>
> I tried my best to take the feedback from the commit-graph
> feature and apply it to this feature. I also worked to
> follow the object-store refactoring as I could. I also have
> some local commits that create a 'verify' subcommand and
> integrate with 'fsck' similar to the commit-graph, but I'll
> leave those for a later series (and review is still underway
> for that part of the commit-graph).
>
> One place where I could use some guidance is related to the
> current state of 'the_hash_algo' patches. The file format
> allows a different "hash version" which then indicates the
> length of the hash. What's the best way to ensure this
> feature doesn't cause extra pain in the hash-agnostic series?
> This will inform how I go back and make the commit-graph
> feature better in this area, too.
>
>
> Thanks,
> -Stolee