Re: [RFC PATCH 00/13] Add bundle-uri: resumably clones, static "dumb" CDN etc.

Ævar Arnfjörð Bjarmason <avarab@xxxxxxxxx> · Sat, 07 Aug 2021 04:19:59 +0200

On Fri, Aug 06 2021, Jonathan Nieder wrote:

> Ævar Arnfjörð Bjarmason wrote:
>
>> Or perhaps not, but they're my currently my best effort to explain the
>> differences between the two and how they interact. So I think it's best
>> to point to those instead of coming up with something in this reply,
>> which'll inevitably be an incomplete rewrite of much of that.
>>
>> In short, there are use-cases that packfile-uri is inherently unsuitable
>> for, or rather changing the packfile-uri feature to support them would
>> pretty much make it indistinguishable from this bundle-uri mechanism,
>> which I think would just add more confusion to the protocol.
>
> Hm.  I was hoping you might say more about those use cases --- e.g. is
> there a concrete installation that wants to take advantage of this?
> By focusing on the real-world example, we'd get a better shared
> understanding of the underlying constraints.

I hacked this up for potential use on GitLab's infrastructure, mainly as
a mechanism to relieve CPU pressure on CI thundering herds.

Often you need full clones, and you sometimes need to do those from
scratch. When you've just had a push come in it's handy to convert those
to incremental fetches on top of a bundle you made recently.

It's not deployed on anything currently, it's just something I've been
hacking up. I'll be on vacation much of the rest of this month, the plan
is to start stressing it on real-world use-cases after that. I thought
I'd send this RFC first.

> After all, both are ways to reduce the bandwidth of a clone or other
> large fetch operation by offloading the bulk of content to static
> serving.

The support we have for packfile-uri in git.git now as far as the server
side goes, I think it's fair to say, fairly immature, I gather that
JGit's version is more advanced, and that's what's serving up things at
Google at e.g. https://chromium.googlesource.com.

I.e. right now for git-upload-pack you need to exhaustively enumerate
all the objects to exclude, although there's some on-list patches
recently for being able to supply tips.

More importantly your CDN reliability MUST match that of your git
server, otherwise your clone fails (as the server has already sent the
"other side" of the expected CDN pack).

Furthermore you MUST as the server be able to tell the client what pack
checksum on the CDN they should expect, which requires a very tight
coupling between git server and CDN.

You can't e.g. say "bootstrap with this bundle/pack" and point to
something like Debian's async-updated FTP network as a source. The
bootstrap data may or may not be there, and it may or may not be as
up-to-date as you'd like.

I think any proposed integration into git.git should mainly consider the
bulk of users, the big hosting providers can always run with their own
patches.

I think this approach opens up the potential for easier and therefore
wider CDN integration for git servers for providers that aren't one of
the big fish. E.g. your CDN generation can be daily cronjob, and the
server can point to it blindly and hope for the best. The client will
optimistically use the CDN data, and recover if not.

I think one thing I oversold is the "path to resumable clones",
i.e. that's all true, but I don't think that's really any harder go do
with packfile-uri in theory (in practice just serving up a sensible pack
with it is pretty tedious with git-upload-pack as it stands).

The negotiation aspect of it is also interesting and something I've been
experimenting with. I.e. the bundles are what the client sends as its
HAVE tips. This allows a server to anticipate what dialog newly cloning
clients are likely to have, and even pre-populate a cache of that
locally (it could even serve that diff up as a packfile-uri :).

Right now it retrieves each bundle in full before adding the tips to
negotiate to a subsequent dialog, but I've successfully experimental
locally with negotiating on the basis of objects we don't even have
yet. I.e. download the bundle(s), and as soon as we have the header fire
off the dialog with the server to get its PACK on the basis of those
promised tips.

It makes the recovery a bit more involved in case the bundles don't have
what we're after, but this allows us to disconnect really fast from the
server and twiddle our own thumbs while we finish downloading the
bundles to get the full clone. We already disconnect early in cases
where the bundle(s) already have what we're after.

This E-Mail got rather long, but hey, I did point you parts of this
series that cover some/most of this, and since it wasn't clear what if
anything of that you'd read ... :) Hopefully this summary is useful.