Re: Resolving deltas dominates clone time

Ævar Arnfjörð Bjarmason <avarab@xxxxxxxxx> · Mon, 22 Apr 2019 20:01:15 +0200

On Mon, Apr 22 2019, Jeff King wrote:

> On Sat, Apr 20, 2019 at 09:59:12AM +0200, Ævar Arnfjörð Bjarmason wrote:
>
>> > If you don't mind losing the collision-detection, using openssl's sha1
>> > might help. The delta resolution should be threaded, too. So in _theory_
>> > you're using 66 minutes of CPU time, but that should only take 1-2
>> > minutes on your 56-core machine. I don't know at what point you'd run
>> > into lock contention, though. The locking there is quite coarse.
>>
>> There's also my (been meaning to re-roll)
>> https://public-inbox.org/git/20181113201910.11518-1-avarab@xxxxxxxxx/
>> *that* part of the SHA-1 checking is part of what's going on here. It'll
>> help a *tiny* bit, but of course is part of the "trust remote" risk
>> management...
>
> I think we're talking about two different collision detections, and your
> patch wouldn't help at all here.
>
> Your patch is optionally removing the "woah, we got an object with a
> duplicate sha1, let's check that the bytes are the same in both copies"
> check. But Martin's problem is a clone, so we wouldn't have any existing
> objects to duplicate in the first place.

Right, but we do anyway, as reported by Geert at @amazon.com preceding
that patch of mine. But it is 99.99% irrelevant to *performance* in this
case after the loose object cache you added (but before that could make
all the difference depending on the FS).

I just mentioned it to plant a flag on another bit of the code where
index-pack in general has certain paranoias/validation the user might be
willing to optionally drop just at "clone" time.

> The problem in his case is literally just that the actual SHA-1 is
> expensive, and that can be helped by using the optimized openssl
> implementation rather than the sha1dc (which checks not collisions with
> objects we _have_, but evidence of somebody trying to exploit weaknesses
> in sha1).
>
> One thing we could do to make that easier is a run-time flag to switch
> between sha1dc and a faster implementation (either openssl or blk_sha1,
> depending on the build). That would let you flip the "trust" bit per
> operation, rather than having it baked into your build.

Yeah, this would be neat.

> (Note that the oft-discussed "use a faster sha1 implementation for
> checksums, but sha1dc for object hashing" idea would not help here,
> because these really are object hashes whose time is dominating. We have
> to checksum 8GB of raw packfile but 2TB of object data).
>
>> I started to write:
>>
>>     I wonder if there's room for some tacit client/server cooperation
>>     without such a protocol change.
>>
>>     E.g. the server sending over a pack constructed in such a way that
>>     everything required for a checkout is at the beginning of the
>>     data. Now we implicitly tend to do it mostly the other way around
>>     for delta optimization purposes.
>>
>>     That would allow a smart client in a hurry to index-pack it as they
>>     go along, and as soon as they have enough to check out HEAD return
>>     to the client, and continue the rest in the background
>
> Interesting idea. You're not reducing the total client effort, but
> you're improving latency of getting the user to a checkout. Of course
> that doesn't help if they want to run "git log" as their first
> operation. ;)
>
>> But realized I was just starting to describe something like 'clone
>> --depth=1' followed by a 'fetch --unshallow' in the background, except
>> that would work better (if you did "just the tip" naïvely you'd get
>> 'missing object' on e.g. 'git log', with that ad-hoc hack we'd need to
>> write out two packs etc...).
>
> Right, that would work. I will note one thing, though: the total time to
> do a 1-depth clone followed by an unshallow is probably much higher than
> doing the whole clone as one unit, for two reasons:

Indeed. The hypothesis is that the user doesn't really care about the
clone-time, but the clone-to-repo-mostly-usable time.

>   1. The server won't use reachability bitmaps when serving the
>      follow-up fetch (because shallowness invalidates the reachability
>      data they're caching), so it will spend much more time in the
>      "Counting objects" phase.
>
>   2. The server has to throw away some deltas. Imagine version X of a
>      file in the tip commit is stored as a delta against version Y in
>      that commit's parent. The initial clone has to throw away the
>      on-disk delta of X and send you the whole object (because you are
>      not requesting Y at all). And then in the follow-up fetch, it must
>      either send you Y as a base object (wasting bandwidth), or it must
>      on-the-fly generate a delta from Y to X (wasting CPU).
>
>> But at this point I'm just starting to describe some shoddy version of
>> Documentation/technical/partial-clone.txt :), OTOH there's no "narrow
>> clone and fleshen right away" option.
>
> Yes. And partial-clone suffers from the problems above to an even
> greater extent. ;)
>
>> On protocol extensions: Just having a way to "wget" the corresponding
>> *.idx file from the server would be great, and reduce clone times by a
>> lot. There's the risk of trusting the server, but most people's use-case
>> is going to be pushing right back to the same server, which'll be doing
>> a full validation.
>
> One tricky thing is that the server may be handing you a bespoke .pack
> file. There is no matching ".idx" at all, neither in-memory nor on disk.
> And you would not want the whole on-disk .pack/.idx pair from a site
> like GitHub, where there are objects from many forks.
>
> So in general, I think you'd need some cooperation from the server side
> to ask it to generate and send the .idx that matches the .pack it is
> sending you. Or even if not the .idx format itself, some stable list of
> sha1s that you could use to reproduce it without hashing each
> uncompressed byte yourself.

Yeah, depending on how jt/fetch-cdn-offload is designed (see my
https://public-inbox.org/git/87k1hv6eel.fsf@xxxxxxxxxxxxxxxxxxx/) it
could be (ab)used to do this. I.e. you'd keep a "base" *.{pack,idx}
around for such a purpose.

So in such a case you'd serve up that recent-enough *.{pack,idx} for the
client to "wget", and the client would then trust it (or not) and do the
equivalent of a "fetch" from that point to be 100% up-to-date.

> This could even be stuffed into the pack format and stripped out by
> the receiving index-pack (i.e., each entry is prefixed with "and by
> the way, here is my sha1...").

That would be really interesting. I.e. just having room for that (or
anything else) in the pack format.

I wonder if it could be added to the delta-chain in the current format
as a nasty hack :)

>> We could also defer that validation instead of skipping it. E.g. wget
>> *.{pack,idx} followed by a 'fsck' in the background. I've sometimes
>> wanted that anyway, i.e. "fsck --auto" similar to "gc --auto"
>> periodically to detect repository bitflips.
>>
>> Or, do some "narrow" validation of such an *.idx file right
>> away. E.g. for all the trees/blobs required for the current checkout,
>> and background the rest.
>
> The "do we have all of the objects we need" is already separate from
> "figure out the sha1 of each object", so I think you'd get that
> naturally if you just took in an untrusted .idx (it also demonstrates
> that any .idx cost is really focusing on blobs, because the "do we have
> all objects" check is going to decompress every commit and tree in the
> repo anyway).
>
> -Peff