Re: [PATCH v3 5/5] doc: add technical design doc for large object promisors

Christian Couder <christian.couder@xxxxxxxxx> · Tue, 18 Feb 2025 12:42:55 +0100

On Mon, Jan 27, 2025 at 7:02 PM Junio C Hamano <gitster@xxxxxxxxx> wrote:
>
> Christian Couder <christian.couder@xxxxxxxxx> writes:
>
> >> > +In other words, the goal of this document is not to talk about all the
> >> > +possible ways to optimize how Git could handle large blobs, but to
> >> > +describe how a LOP based solution could work well and alleviate a
> >> > +number of current issues in the context of Git clients and servers
> >> > +sharing Git objects.
> >>
> >> But if you do not discuss even a single way, and handwave "we'll
> >> have this magical object storage that would solve all the problems
> >> for us", then we cannot really tell if the problem is solved by us,
> >> or by handwaved away by assuming the magical object storage.
> >> We'd need at least one working example.
> >
> > It's not magical object storage. Amazon S3, GCP Bucket and MinIO
> > (which is open source), for example, already exist and are used a lot
> > in the industry.
>
> That's just "we can store bunch of bytes and ask them to be
> retrieved".  What I said about handwaving the presence of magical
> "object storage" is exactly the "optimize how to handle large blobs"
> part.  I agree that we do not need to discuss _ALL_ the possible
> ways.  But without telling what our thoughts on _how_ to use these
> "lower cost and safe by duplication but with high latency" services
> to store our objects efficiently enough to make it practical, I'd
> have to call what we see in the document "magical object storage".

I have added the following:

Even if LOPs are used not very efficiently, they can still be useful
and worth using in some cases because, as we will see in more details
later in this document:

  - they can make it simpler for clients to use promisor remotes and
    therefore avoid fetching a lot of large blobs they might not need
    locally,

  - they can make it significantly cheaper or easier for servers to
    host a significant part of the current repository content, and
    even more to host content with larger blobs or more large blobs
    than currently.

I hope this addresses some of your concerns. I could also talk about
remote helpers and object storage here, but this would be duplicating
the "2) LOPs can use object storage" section. If you think that we
should tell our thoughts about how to improve remote helpers and
object storage performance, I think this should go into that section
rather than here.

> >> > +7) A client can offload to a LOP
> >> > +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> >> > +
> >> > +When a client is using a LOP that is also a LOP of its main remote,
> >> > +the client should be able to offload some large blobs it has fetched,
> >> > +but might not need anymore, to the LOP.
> >>
> >> For a client that _creates_ a large object, the situation would be
> >> the same, right?  After it creates several versions of the opening
> >> segment of, say, a movie, the latest version may be still wanted,
> >> but the creating client may want to offload earlier versions.
> >
> > Yeah, but it's not clear if the versions of the opening segment should
> > be sent directly to the LOP without the main remote checking them in
> > some ways (hooks might be configured only on the main remote) and/or
> > checking that they are connected to the repo. I guess it depends on
> > the context if it would be OK or not.
>
> If it is not clear to us or whoever writes this document, the users
> would have a hard time to make effective use of it, which is why I
> am worried about the current design in this feature.

Yeah, but this feature doesn't exist at all yet, and it might not even
be a priority, so I prefer not to promise too much.

For now, I have added:

"This should be discussed and refined when we get closer to
implementing this feature."

just after:

"It might depend on the context if it should be OK or not for clients
to offload large blobs they have created, instead of fetched, directly
to the LOP without the main remote checking them in some ways
(possibly using hooks or other tools)."