Re: Add a "Flattened Cache" to `git --clone`?

Bryan Turner <bturner@xxxxxxxxxxxxx> · Thu, 14 May 2020 13:54:00 -0700

On Thu, May 14, 2020 at 1:33 PM Konstantin Ryabitsev
<konstantin@xxxxxxxxxxxxxxxxxxx> wrote:
>
> On Thu, May 14, 2020 at 07:34:08AM -0700, Caleb Gray wrote:
> > I've done some searching around the Internet, mailing lists, and
> > reached out in IRC a couple of days ago... and haven't found anyone
> > else asking about a long-brewed contribution idea that I'd finally
> > like to implement. First I wanted to run it by you guys, though, since
> > this is my first time reaching out.
> >
> > Assuming my idea doesn't contradict other best practices or standards
> > already in place,  I'd like to transform the typical `git clone` flow
> > from:
> >
> >  Cloning into 'linux'...
> >  remote: Enumerating objects: 4154, done.
> >  remote: Counting objects: 100% (4154/4154), done.
> >  remote: Compressing objects: 100% (2535/2535), done.
> >  remote: Total 7344127 (delta 2564), reused 2167 (delta 1612),
> > pack-reused 7339973
> >  Receiving objects: 100% (7344127/7344127), 1.22 GiB | 8.51 MiB/s, done.
> >  Resolving deltas: 100% (6180880/6180880), done.
> >
> > To subsequent clones (until cache invalidated) using the "flattened
> > cache" version (presumably built while fulfilling the first clone
> > request above):
> >
> >  Cloning into 'linux'...
> >  Receiving cache: 100% (7344127/7344127), 1.22 GiB | 8.51 MiB/s, done.
>
> I don't think it's a common workflow for someone to repeatedly clone
> linux.git. Automated processes like CI would be doing it, but they tend
> to blow away the local disk between jobs, so they are unlikely to
> benefit from any native git local cache for something like this (in
> fact, we recommend that people use clone.bundle files for their CI
> needs, as described here:
> https://www.kernel.org/best-way-to-do-linux-clones-for-your-ci.html).
>
> I believe there's quite a bit of work being done by Gitlab folks to make
> it possible to offload more object fetching to lookaside-caches like
> CDN. Perhaps one of them can provide an update on how that is going.

I can't speak for Gitlab, but Bitbucket Server (formerly Stash) has
done this for years, and I believe Github does as well. For Bitbucket
Server, our caching doesn't change what the client sees (i.e. they
still see "Counting objects", "Compressing objects"), but the early
steps essentially jump straight to 100% (since that progress
information is included in our cached data) and then the client starts
receiving the pack.

I'm not sure how straightforward--or desirable--it would be for
something like this to be done natively by Git itself. Certainly it
would make building hosting solutions simpler, which could be a win
for simpler setups that don't use something like Bitbucket Server,
Gitlab or Github, but I'm not sure that's a big "win". Effort on
something like clonebundles (in Mercurial parlance) or similar seems
likely to offer a lot more bang for the buck than caching packs for
specific wants/haves.

Just my 2 cents as someone who has directly worked on this sort of caching.

Bryan Turner