Re: [RFC PATCH 00/18] Multi-pack index (MIDX)

Ævar Arnfjörð Bjarmason <avarab@xxxxxxxxx> · Sun, 07 Jan 2018 23:42:27 +0100

On Sun, Jan 07 2018, Derrick Stolee jotted:

>     git log --oneline --raw --parents
>
> Num Packs | Before MIDX | After MIDX |  Rel % | 1 pack %
> ----------+-------------+------------+--------+----------
>         1 |     35.64 s |    35.28 s |  -1.0% |   -1.0%
>        24 |     90.81 s |    40.06 s | -55.9% |  +12.4%
>       127 |    257.97 s |    42.25 s | -83.6% |  +18.6%
>
> The last column is the relative difference between the MIDX-enabled repo
> and the single-pack repo. The goal of the MIDX feature is to present the
> ODB as if it was fully repacked, so there is still room for improvement.
>
> Changing the command to
>
>     git log --oneline --raw --parents --abbrev=40
>
> has no observable difference (sub 1% change in all cases). This is likely
> due to the repack I used putting commits and trees in a small number of
> packfiles so the MRU cache workes very well. On more naturally-created
> lists of packfiles, there can be up to 20% improvement on this command.
>
> We are using a version of this patch with an upcoming release of GVFS.
> This feature is particularly important in that space since GVFS performs
> a "prefetch" step that downloads a pack of commits and trees on a daily
> basis. These packfiles are placed in an alternate that is shared by all
> enlistments. Some users have 150+ packfiles and the MRU misses and
> abbreviation computations are significant. Now, GVFS manages the MIDX file
> after adding new prefetch packfiles using the following command:
>
>     git midx --write --update-head --delete-expired --pack-dir=<alt>

(Not a critique of this, just a (stupid) question)

What's the practical use-case for this feature? Since it doesn't help
with --abbrev=40 the speedup is all in the part that ensures we don't
show an ambiguous SHA-1.

The reason we do that at all is because it makes for a prettier UI.

Are there things that both want the pretty SHA-1 and also care about the
throughput? I'd have expected machine parsing to just use
--no-abbrev-commit.

If something cares about both throughput and e.g. is saving the
abbreviated SHA-1s isn't it better off picking some arbitrary size
(e.g. --abbrev=20), after all the default abbreviation is going to show
something as small as possible, which may soon become ambigous after the
next commit.