Re: Add a "Flattened Cache" to `git --clone`?

Caleb Gray <hey@xxxxxxxxxxxxx> · Thu, 14 May 2020 14:33:06 -0700

To Clarify: I'm talking about a server-side only cache which behaves
much like a `tar` file: it is a flat version of exactly(*) what ends
up on the client's storage. When a client runs `git --clone` and
there's a valid cache on the other end, that's all that gets streamed.

Konstantin's point that a repo like Linux is bound to see little/no
benefit (in fact, it'll just constantly invalidate/rewrite the ~1gb
cache) is reasonable. This feature definitely targets the "niche"
audience of repos with less-frequent-pushes-to-master-than-clones.

Bryan is exactly on the right track for what I'm referring to: the CDN
approach did come to mind (and is superior in nearly every way).

Junio nailed it: I'm not hoping for anything revolutionary here, just
hoping to reduce the redundant steps in clone down to a single
(presumably faster) step.

If the community agrees that there's little/no benefit to the
limitations of having a "cache for master and that's all," I'm also
more than capable of designing a more useful/complex graph/reduce
based solution which could dynamically bundle the most statistically
relevant data for whatever context the code is working in, though-- I
can't commit to any sort of deadline for that sort of a contribution.

On Thu, May 14, 2020 at 2:05 PM Theodore Y. Ts'o <tytso@xxxxxxx> wrote:
>
> On Thu, May 14, 2020 at 04:33:26PM -0400, Konstantin Ryabitsev wrote:
> > > Assuming my idea doesn't contradict other best practices or standards
> > > already in place,  I'd like to transform the typical `git clone` flow
> > > from:
> > >
> > >  Cloning into 'linux'...
> > >  remote: Enumerating objects: 4154, done.
> > >  remote: Counting objects: 100% (4154/4154), done.
> > >  remote: Compressing objects: 100% (2535/2535), done.
> > >  remote: Total 7344127 (delta 2564), reused 2167 (delta 1612),
> > > pack-reused 7339973
> > >  Receiving objects: 100% (7344127/7344127), 1.22 GiB | 8.51 MiB/s, done.
> > >  Resolving deltas: 100% (6180880/6180880), done.
> > >
> > > To subsequent clones (until cache invalidated) using the "flattened
> > > cache" version (presumably built while fulfilling the first clone
> > > request above):
> > >
> > >  Cloning into 'linux'...
> > >  Receiving cache: 100% (7344127/7344127), 1.22 GiB | 8.51 MiB/s, done.
> >
> > I don't think it's a common workflow for someone to repeatedly clone
> > linux.git. Automated processes like CI would be doing it, but they tend
> > to blow away the local disk between jobs, so they are unlikely to
> > benefit from any native git local cache for something like this (in
> > fact, we recommend that people use clone.bundle files for their CI
> > needs, as described here:
> > https://www.kernel.org/best-way-to-do-linux-clones-for-your-ci.html).
>
> If the goal is a git local cache, we have this today.  I'm not sure
> this is what Caleb was asking for, though:
>
> git clone --bare https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git base
> git clone --reference base https://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git ext4
>
>                                                         - Ted