Re: Resolving deltas dominates clone time

Ævar Arnfjörð Bjarmason <avarab@xxxxxxxxx> · Sat, 20 Apr 2019 09:59:12 +0200

On Sat, Apr 20 2019, Jeff King wrote:

> On Fri, Apr 19, 2019 at 03:47:22PM -0600, Martin Fick wrote:
>
>> I have been thinking about this problem, and I suspect that this compute time
>> is actually spent doing SHA1 calculations, is that possible? Some basic back
>> of the envelope math and scripting seems to show that the repo may actually
>> contain about 2TB of data if you add up the size of all the objects in the
>> repo. Some quick research on the net seems to indicate that we might be able
>> to expect something around 500MB/s throughput on computing SHA1s, does that
>> seem reasonable? If I really have 2TB of data, should it then take around
>> 66mins to get the SHA1s for all that data? Could my repo clone time really be
>> dominated by SHA1 math?
>
> That sounds about right, actually. 8GB to 2TB is a compression ratio of
> 250:1. That's bigger than I've seen, but I get 51:1 in the kernel.
>
> Try this (with a recent version of git; your v1.8.2.1 won't have
> --batch-all-objects):
>
>   # count the on-disk size of all objects
>   git cat-file --batch-all-objects --batch-check='%(objectsize) %(objectsize:disk)' |
>   perl -alne '
>     $repo += $F[0];
>     $disk += $F[1];
>     END { print "$repo / $disk = ", $repo/$disk }
>   '
>
> 250:1 isn't inconceivable if you have large blobs which have small
> changes to them (and at 8GB for 8 million objects, you probably do have
> some larger blobs, since the kernel is about 1/8th the size for the same
> number of objects).
>
> So yes, if you really do have to hash 2TB of data, that's going to take
> a while. "openssl speed" on my machine gives per-second speeds of:
>
> type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes  16384 bytes
> sha1            135340.73k   337086.10k   677821.10k   909513.73k  1007528.62k  1016916.65k
>
> So it's faster on bigger chunks, but yeah 500-1000MB/s seems like about
> the best you're going to do. And...
>
>> I mention 1.8.2.1 because we have many old machines which need this. However,
>> I also tested this with git v2.18 and it actually is much slower even
>> (~140mins).
>
> I think v2.18 will have the collision-detecting sha1 on by default,
> which is slower. Building with OPENSSL_SHA1 should be the fastest (and
> are those numbers above). Git's internal (but not collision detecting)
> BLK_SHA1 is somewhere in the middle.
>
>> Any advice on how to speed up cloning this repo, or what to pursue more
>> in my investigation?
>
> If you don't mind losing the collision-detection, using openssl's sha1
> might help. The delta resolution should be threaded, too. So in _theory_
> you're using 66 minutes of CPU time, but that should only take 1-2
> minutes on your 56-core machine. I don't know at what point you'd run
> into lock contention, though. The locking there is quite coarse.

There's also my (been meaning to re-roll)
https://public-inbox.org/git/20181113201910.11518-1-avarab@xxxxxxxxx/
*that* part of the SHA-1 checking is part of what's going on here. It'll
help a *tiny* bit, but of course is part of the "trust remote" risk
management...

> We also hash non-deltas while we're receiving them over the network.
> That's accounted for in the "receiving pack" part of the progress meter.
> If the time looks to be going to "resolving deltas", then that should
> all be threaded.
>
> If you want to replay the slow part, it should just be index-pack. So
> something like (with $old as a fresh clone of the repo):
>
>   git init --bare new-repo.git
>   cd new-repo.git
>   perf record git index-pack -v --stdin <$old/.git/objects/pack/pack-*.pack
>   perf report
>
> should show you where the time is going (substitute perf with whatever
> profiling tool you like).
>
> As far as avoiding that work altogether, there aren't a lot of options.
> Git clients do not trust the server, so the server sends only the raw
> data, and the client is responsible for computing the object ids. The
> only exception is a local filesystem clone, which will blindly copy or
> hardlink the .pack and .idx files from the source.
>
> In theory there could be a protocol extension to let the client say "I
> trust you, please send me the matching .idx that goes with this pack,
> and I'll assume there was no bitrot nor trickery on your part". I
> don't recall anybody ever discussing such a patch in the past, but I
> think Microsoft's VFS for Git project that backs development on Windows
> might do similar trickery under the hood.

I started to write:

    I wonder if there's room for some tacit client/server cooperation
    without such a protocol change.

    E.g. the server sending over a pack constructed in such a way that
    everything required for a checkout is at the beginning of the
    data. Now we implicitly tend to do it mostly the other way around
    for delta optimization purposes.

    That would allow a smart client in a hurry to index-pack it as they
    go along, and as soon as they have enough to check out HEAD return
    to the client, and continue the rest in the background

But realized I was just starting to describe something like 'clone
--depth=1' followed by a 'fetch --unshallow' in the background, except
that would work better (if you did "just the tip" naïvely you'd get
'missing object' on e.g. 'git log', with that ad-hoc hack we'd need to
write out two packs etc...).

    $ rm -rf /tmp/git; time git clone --depth=1 https://chromium.googlesource.com/chromium/src /tmp/git; time git -C /tmp/git fetch --unshallow
    Cloning into '/tmp/git'...
    remote: Counting objects: 304839, done
    remote: Finding sources: 100% (304839/304839)
    remote: Total 304839 (delta 70483), reused 204837 (delta 70483)
    Receiving objects: 100% (304839/304839), 1.48 GiB | 19.87 MiB/s, done.
    Resolving deltas: 100% (70483/70483), done.
    Checking out files: 100% (302768/302768), done.

    real    2m10.223s
    user    1m2.434s
    sys     0m15.564s
    [not waiting for that second bit, but it'll take ages...]

I think just having a clone mode that did that for you might scratch a
lot of people's itch. I.e. "I want full history, but mainly want a
checkout right away, so background the full clone".

But at this point I'm just starting to describe some shoddy version of
Documentation/technical/partial-clone.txt :), OTOH there's no "narrow
clone and fleshen right away" option.

On protocol extensions: Just having a way to "wget" the corresponding
*.idx file from the server would be great, and reduce clone times by a
lot. There's the risk of trusting the server, but most people's use-case
is going to be pushing right back to the same server, which'll be doing
a full validation.

We could also defer that validation instead of skipping it. E.g. wget
*.{pack,idx} followed by a 'fsck' in the background. I've sometimes
wanted that anyway, i.e. "fsck --auto" similar to "gc --auto"
periodically to detect repository bitflips.

Or, do some "narrow" validation of such an *.idx file right
away. E.g. for all the trees/blobs required for the current checkout,
and background the rest.