Re: [PATCH v2 0/4] Introduce a "promisor-remote" capability

Patrick Steinhardt <ps@xxxxxx> · Mon, 30 Sep 2024 09:57:17 +0200

On Fri, Sep 27, 2024 at 03:48:11PM -0700, Junio C Hamano wrote:
> Christian Couder <christian.couder@xxxxxxxxx> writes:
> 
> > By the way there was an unconference breakout session on day 2 of the
> > Git Merge called "Git LFS Can we do better?" where this was discussed
> > with a number of people. Scott Chacon took some notes:
> >
> > https://github.com/git/git-merge/blob/main/breakouts/git-lfs.md
> 
> Thanks for a link.
> 
> > It was in parallel with the Contributor Summit, so few contributors
> > participated in this session (maybe only Michael Haggerty, John Cai
> > and me). But the impression of GitLab people there, including me, was
> > that folks in general would be happy to have an alternative to Git LFS
> > based on this.
> 
> I am not sure what "based on this" is really about, though.
> 
> This series adds a feature to redirect requests to one server to
> another, but does it really have much to solve the problem LFS wants
> to solve?  I would imagine that you would want to be able to manage
> larger objects separately to avoid affecting the performance and
> convenience when handling smaller objects, and to serve these larger
> objects from a dedicated server.  You certainly can filter the
> larger blobs away with blob size filter, but when you really need
> these larger blobs, it is unclear how the new capability helps, as
> you cannot really tell what the criteria the serving side that gave
> you the "promisor-remote" capability wants you to use to sift your
> requests between the original server and the new promisor.  Wouldn't
> your requests _all_ be redirected to a single place, the promisor
> remote you learned via the capability?
> 
> Coming up with a better alternative to LFS is certainly good, and it
> is worthwhile addtion to the system.  I just do not see how the
> topic of this series helps further that goal.

I guess it helps to address part of the problem. I'm not sure whether my
understanding is aligned with Chris' intention, but I could certainly
see that at some point in time we start to advertise promisor remote
URLs that use different transport helpers to fetch objects. This would
allow hosting providers to offload objects to e.g. blob storage or
somesuch thing and the client would know how to fetch them.

But there are still a couple of pieces missing in the bigger puzzle:

  - How would a client know to omit certain objects? Right now it only
    knows that there are promisor remotes, but it doesn't know that it
    e.g. should omit every blob larger than X megabytes. The answer
    could of course be that the client should just know to do a partial
    clone by themselves.

  - Storing those large objects locally is still expensive. We had
    discussions in the past where such objects could be stored
    uncompressed to stop wasting compute here. At GitLab, we're thinking
    about the ability to use rolling hash functions to chunk such big
    objects into smaller parts to also allow for somewhat efficient
    deduplication. We're also thinking about how to make the overall ODB
    pluggable such that we can eventually make it more scalable in this
    context. But that's of course thinking into the future quite a bit.

  - Local repositories would likely want to prune large objects that
    have not been accessed for a while to eventually regain some storage
    space.

I think chipping away the problems one by one is fine. But it would be
nice to draw something like a "big picture" of where we eventually want
to end up at and how all the parts connect with each other to form a
viable native replacement for Git LFS.

Also Cc'ing brian, who likely has a thing or two to say about this :)

Patrick