Re: [PATCH v2 0/4] Introduce a "promisor-remote" capability

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Mon, Sep 30, 2024 at 9:57 AM Patrick Steinhardt <ps@xxxxxx> wrote:
>
> On Fri, Sep 27, 2024 at 03:48:11PM -0700, Junio C Hamano wrote:
> > Christian Couder <christian.couder@xxxxxxxxx> writes:
> >
> > > By the way there was an unconference breakout session on day 2 of the
> > > Git Merge called "Git LFS Can we do better?" where this was discussed
> > > with a number of people. Scott Chacon took some notes:
> > >
> > > https://github.com/git/git-merge/blob/main/breakouts/git-lfs.md
> >
> > Thanks for a link.
> >
> > > It was in parallel with the Contributor Summit, so few contributors
> > > participated in this session (maybe only Michael Haggerty, John Cai
> > > and me). But the impression of GitLab people there, including me, was
> > > that folks in general would be happy to have an alternative to Git LFS
> > > based on this.
> >
> > I am not sure what "based on this" is really about, though.
> >
> > This series adds a feature to redirect requests to one server to
> > another, but does it really have much to solve the problem LFS wants
> > to solve?  I would imagine that you would want to be able to manage
> > larger objects separately to avoid affecting the performance and
> > convenience when handling smaller objects, and to serve these larger
> > objects from a dedicated server.  You certainly can filter the
> > larger blobs away with blob size filter, but when you really need
> > these larger blobs, it is unclear how the new capability helps, as
> > you cannot really tell what the criteria the serving side that gave
> > you the "promisor-remote" capability wants you to use to sift your
> > requests between the original server and the new promisor.  Wouldn't
> > your requests _all_ be redirected to a single place, the promisor
> > remote you learned via the capability?
> >
> > Coming up with a better alternative to LFS is certainly good, and it
> > is worthwhile addtion to the system.  I just do not see how the
> > topic of this series helps further that goal.
>
> I guess it helps to address part of the problem. I'm not sure whether my
> understanding is aligned with Chris' intention, but I could certainly
> see that at some point in time we start to advertise promisor remote
> URLs that use different transport helpers to fetch objects. This would
> allow hosting providers to offload objects to e.g. blob storage or
> somesuch thing and the client would know how to fetch them.
>
> But there are still a couple of pieces missing in the bigger puzzle:
>
>   - How would a client know to omit certain objects? Right now it only
>     knows that there are promisor remotes, but it doesn't know that it
>     e.g. should omit every blob larger than X megabytes. The answer
>     could of course be that the client should just know to do a partial
>     clone by themselves.

If we add a "filter" field to the "promisor-remote" capability in a
future patch series, then the server could pass information like a
filter-spec that the client could use to omit some large blobs.

Patch 3/4 has the following in its commit message about it: "In the
future, it might be possible to pass other information like a
filter-spec that the client should use when cloning from S".

>   - Storing those large objects locally is still expensive. We had
>     discussions in the past where such objects could be stored
>     uncompressed to stop wasting compute here.

Yeah, I think a new "verbatim" object representation in the object
database as discussed in
https://lore.kernel.org/git/xmqqbkdometi.fsf@gitster.g/ is the most
likely and easiest in the short term.

> At GitLab, we're thinking
>     about the ability to use rolling hash functions to chunk such big
>     objects into smaller parts to also allow for somewhat efficient
>     deduplication. We're also thinking about how to make the overall ODB
>     pluggable such that we can eventually make it more scalable in this
>     context. But that's of course thinking into the future quite a bit.

Yeah, there are different options for this. For example HuggingFace
(https://huggingface.co/) recently acquired the XetHub company (see
https://huggingface.co/blog/xethub-joins-hf), and said they might open
source XetHub software that does chunking and deduplicates chunks, so
that could be an option too.

>   - Local repositories would likely want to prune large objects that
>     have not been accessed for a while to eventually regain some storage
>     space.

`git repack --filter` and such might already help a bit in this area.
I agree that more work is needed though.

> I think chipping away the problems one by one is fine. But it would be
> nice to draw something like a "big picture" of where we eventually want
> to end up at and how all the parts connect with each other to form a
> viable native replacement for Git LFS.

I have tried to discuss this at the Git Merge 2022 and 2024 and
perhaps even before that. But as you know it's difficult to make
people agree on big projects that are not backed by patches and that
might span over several years (especially when very few people
actually work on them and when they might have other things to work on
too).

Thanks,
Christian.





[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux