Re: With big repos and slower connections, git clone can be hard to work with

Emily Shaffer <nasamuffin@xxxxxxxxxx> · Mon, 10 Jun 2024 12:04:30 -0700

On Sat, Jun 8, 2024 at 1:43 AM Jeff King <peff@xxxxxxxx> wrote:
>
> On Sat, Jun 08, 2024 at 02:46:38AM +0200, ellie wrote:
>
> > The deepening worked perfectly, thank you so much! I hope a resume will
> > still be considered however, if even just to help out newcomers.
>
> Because the packfile to send the user is created on the fly, making a
> clone fully resumable is tricky (a second clone may get an equivalent
> but slightly different pack due to new objects entering the repo, or
> even raciness between threads).
>
> One strategy people have worked on is for servers to point clients at
> static packfiles (which _do_ remain byte-for-byte identical, and can be
> resumed) to get some of the objects. But it requires some scheme on the
> server side to decide when and how to create those packfiles. So while
> there is support inside Git itself for this idea (both on the server and
> client side), I don't know of any servers where it is in active use.

We use packfile offloading heavily at Google (any repositories hosted
at *.googlesource.com, as well as our internal-facing hosting). It
works quite well for us scaling large projects like Android and
Chrome; we've been using it for some time now and are happy with it.

However, one thing that's missing is the resumable download Ellie is
describing. With a clone which has been turned into a packfile fetch
from a different data store, it *should* be resumable. But the client
currently lacks the ability to do that. (This just came up for us
internally the other day, and we ended up moving an internal bug to
https://git.g-issues.gerritcodereview.com/issues/345241684.) After a
resumed clone like this, you may not necessarily have latest - for
example, you may lose connection with 90% of the clone finished, then
not get connection back for some days, after which point upstream has
moved as Peff described elsewhere in this thread. But it would still
probably be cheaper to resume that 10% of packfile fetch from the
offloaded data store, then do an incremental fetch back to the server
to get the couple days of updates on top, as compared to starting over
from zero with the server.

It seems to me that packfile URIs and bundle URIs are similar enough
that we could work out similar logic for both, no? Or maybe there's
something I'm missing about the way bundle offloading differs from
packfiles.

 - Emily

>
> -Peff
>