Re: [PATCH v3 5/5] doc: add technical design doc for large object promisors

Junio C Hamano <gitster@xxxxxxxxx> · Tue, 10 Dec 2024 20:43:03 +0900

Christian Couder <christian.couder@xxxxxxxxx> writes:

> +We will call a "Large Object Promisor", or "LOP" in short, a promisor
> +remote which is used to store only large blobs and which is separate
> +from the main remote that should store the other Git objects and the
> +rest of the repos.
> +
> +By extension, we will also call "Large Object Promisor", or LOP, the
> +effort described in this document to add a set of features to make it
> +easier to handle large blobs/files in Git by using LOPs.
> +
> +This effort would especially improve things on the server side, and
> +especially for large blobs that are already compressed in a binary
> +format.

The implementation on the server side can be hidden and be improved
as long as we have a reasonable wire protocol.  As it stands, even
with the promisor-remote referral extension, the data coming from
LOP still is expected to be a pack stream, which I am not sure is a
good match.  Is the expectation (yes, I know the document later says
it won't go into storage layer, but still, in order to get the
details of the protocol extension right, we MUST have some idea on
the characteristics the storage layer has so that the protocol would
work well with the storage implementation with such characteristics)
that we give up on deltifying these LOP objects (which might be a
sensible assumption, if they are incompressible large binary gunk),
we store each object in LOP as base representation inside a pack
stream (i.e. the in-pack "undeltified representation" defined in
Documentation/gitformat-pack.txt), so that to send these LOP objects
is just the matter of preparing the pack header (PACK + version +
numobjects) and then concatenating these objects while computing the
running checksum to place in the trailer of the pack stream?  Could
it still be too expensive for the server side, having to compute the
running sum, and we might want to update the object transfer part of
the pack stream definition somehow to reduce the load on the server
side?

> +- We will not discuss those client side improvements here, as they
> +  would require changes in different parts of Git than this effort.
> ++
> +So we don't pretend to fully replace Git LFS with only this effort,
> +but we nevertheless believe that it can significantly improve the
> +current situation on the server side, and that other separate
> +efforts could also improve the situation on the client side.

We still need to come up with a minimally working client side
components, if our goal were to only improve the server side, in
order to demonstrate the benefit of the effort.

> +In other words, the goal of this document is not to talk about all the
> +possible ways to optimize how Git could handle large blobs, but to
> +describe how a LOP based solution could work well and alleviate a
> +number of current issues in the context of Git clients and servers
> +sharing Git objects.

But if you do not discuss even a single way, and handwave "we'll
have this magical object storage that would solve all the problems
for us", then we cannot really tell if the problem is solved by us,
or by handwaved away by assuming the magical object storage.  We'd
need at least one working example.

> +6) A protocol negotiation should happen when a client clones
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +When a client clones from a main repo, there should be a protocol
> +negotiation so that the server can advertise one or more LOPs and so
> +that the client and the server can discuss if the client could
> +directly use a LOP the server is advertising. If the client and the
> +server can agree on that, then the client would be able to get the
> +large blobs directly from the LOP and the server would not need to
> +fetch those blobs from the LOP to be able to serve the client.
> +
> +Note
> +++++
> +
> +For fetches instead of clones, see the "What about fetches?" FAQ entry
> +below.
> +
> +Rationale
> ++++++++++
> +
> +Security, configurability and efficiency of setting things up.

It is unclear how it improves security and configurability if we
limit the protocol exchange only at the clone time (implying that
later either side cannot change it).  It will lead to security
issues if we assume that it is impossible for one side to "lie" to
the other side what they earlier agreed on (unless we somehow make
it actually impossible to lie to the other side, of course).

> +7) A client can offload to a LOP
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +When a client is using a LOP that is also a LOP of its main remote,
> +the client should be able to offload some large blobs it has fetched,
> +but might not need anymore, to the LOP.

For a client that _creates_ a large object, the situation would be
the same, right?  After it creates several versions of the opening
segment of, say, a movie, the latest version may be still wanted,
but the creating client may want to offload earlier versions.