Re: Resolving deltas dominates clone time

Ævar Arnfjörð Bjarmason <avarab@xxxxxxxxx> · Tue, 23 Apr 2019 09:07:06 +0200

On Mon, Apr 22 2019, Jeff King wrote:

> On Mon, Apr 22, 2019 at 08:01:15PM +0200, Ævar Arnfjörð Bjarmason wrote:
>
>> > Your patch is optionally removing the "woah, we got an object with a
>> > duplicate sha1, let's check that the bytes are the same in both copies"
>> > check. But Martin's problem is a clone, so we wouldn't have any existing
>> > objects to duplicate in the first place.
>>
>> Right, but we do anyway, as reported by Geert at @amazon.com preceding
>> that patch of mine. But it is 99.99% irrelevant to *performance* in this
>> case after the loose object cache you added (but before that could make
>> all the difference depending on the FS).
>
> I scratched my head at this a bit. If we don't have any other objects,
> then what are we comparing against? But I think you mean that we have
> the overhead of doing the object lookups to find that out. Yes, that can
> add up if your filesystem has high latency, but I think in this case it
> is a drop in the bucket compared to dealing with the actual object data.

There was no "we have no objects" clause, so this bit is what dominated
clone time before the loose object cache...

>> I just mentioned it to plant a flag on another bit of the code where
>> index-pack in general has certain paranoias/validation the user might be
>> willing to optionally drop just at "clone" time.
>
> Yeah, I agree it may be worth pursuing independently. I just don't think
> it will help Martin's case in any noticeable way.

...indeed, as noted just mentioning this in the context of things in
index-pack that *in general* might benefit from some "we had no objects
before" special-case.

>> > Right, that would work. I will note one thing, though: the total time to
>> > do a 1-depth clone followed by an unshallow is probably much higher than
>> > doing the whole clone as one unit, for two reasons:
>>
>> Indeed. The hypothesis is that the user doesn't really care about the
>> clone-time, but the clone-to-repo-mostly-usable time.
>
> There was a little bit of self-interest in there for me, too, as a
> server operator. While it does add to the end-to-end time, most of the
> resource use for the shallow fetch gets put on the server. IOW, I don't
> think we'd be happy to see clients doing this depth-1-and-then-unshallow
> strategy for every clone.

Just change from per-seat pricing to charging a premium for CPU &
I/O. Now your problem is a solution :)

More seriously, yeah I think we definitely need to be careful about
changes to git that'll eat someone "free" server time to save the client
time/work.

At the same time we have dedicated internal operators who wouldn't mind
spending that CPU. So hopefully we can in general find some reasonable
middle-ground.

>> > So in general, I think you'd need some cooperation from the server side
>> > to ask it to generate and send the .idx that matches the .pack it is
>> > sending you. Or even if not the .idx format itself, some stable list of
>> > sha1s that you could use to reproduce it without hashing each
>> > uncompressed byte yourself.
>>
>> Yeah, depending on how jt/fetch-cdn-offload is designed (see my
>> https://public-inbox.org/git/87k1hv6eel.fsf@xxxxxxxxxxxxxxxxxxx/) it
>> could be (ab)used to do this. I.e. you'd keep a "base" *.{pack,idx}
>> around for such a purpose.
>>
>> So in such a case you'd serve up that recent-enough *.{pack,idx} for the
>> client to "wget", and the client would then trust it (or not) and do the
>> equivalent of a "fetch" from that point to be 100% up-to-date.
>
> I think it's sort of orthogonal. Either way you have to teach the client
> how to get a .pack/.idx combo. Whether it learns to receive it inline
> from the first fetch, or whether it is taught to expect it from the
> out-of-band fetch, most of the challenge is the same.

I think it is, but maybe we're talking about different things.

I suspect a few of us have experimented with similar rsync-and-pull
hacks as a replacement for "clone". It's much faster (often 50-90%
faster).

I.e. just an rsync of a recent-enough .git dir (or .git/objects),
followed by a 'reset --hard' to get the worktree and then a 'git pull'.

>> > This could even be stuffed into the pack format and stripped out by
>> > the receiving index-pack (i.e., each entry is prefixed with "and by
>> > the way, here is my sha1...").
>>
>> That would be really interesting. I.e. just having room for that (or
>> anything else) in the pack format.
>>
>> I wonder if it could be added to the delta-chain in the current format
>> as a nasty hack :)
>
> There's definitely not "room" in any sense of the word in the pack
> format. :) However, as long as all parties agreed, we can stick whatever
> we want into the on-the-wire format. So I was imagining something more
> like:
>
>   1. pack-objects learns a --report-object-id option that sticks some
>      additional bytes before each object (in its simplest form,
>      $obj_hash bytes of id)
>
>   2. likewise, index-pack learns a --parse-object-id option to receive
>      it and skip hashing the object bytes
>
>   3. we get a new protocol capability, "send-object-ids". If the server
>      advertises and the client requests it, then both sides turn on the
>      appropriate option
>
> You could even imagine generalizing it to "--report-object-metadata",
> and including 0 or more metadata packets before each object. With object
> id being one, but possibly other computable bits like "generation
> number" after that. I'm not convinced other metadata is worth the
> space/time tradeoff, though. After all, this is stuff that the client
> _could_ generate and cache themselves, so you're trading off bandwidth
> to save the client from doing the computation.
>
> Anyway, food for thought. :)