Re: [PATCH v3 5/5] doc: add technical design doc for large object promisors

Patrick Steinhardt <ps@xxxxxx> · Mon, 16 Dec 2024 10:00:36 +0100

On Tue, Dec 10, 2024 at 08:43:03PM +0900, Junio C Hamano wrote:
> Christian Couder <christian.couder@xxxxxxxxx> writes:
> > +In other words, the goal of this document is not to talk about all the
> > +possible ways to optimize how Git could handle large blobs, but to
> > +describe how a LOP based solution could work well and alleviate a
> > +number of current issues in the context of Git clients and servers
> > +sharing Git objects.
> 
> But if you do not discuss even a single way, and handwave "we'll
> have this magical object storage that would solve all the problems
> for us", then we cannot really tell if the problem is solved by us,
> or by handwaved away by assuming the magical object storage.  We'd
> need at least one working example.

It's something we're working on in parallel with the effort to slowly
move towards pluggable object databases. We aren't yet totally clear
on how exactly to store such objects, but there are a couple of ideas:

  - Store large objects verbatim in a separate path without any kind of
    compression at all. It solves the problem of wasting compute time
    during compression, but does not solve the problem of having to
    store blobs multiple times even if only a tiny part of them change.

  - Use a rolling hash function to split up large objects into smaller
    hunks that can be deduplicated. This solves the issue of only small
    parts of the binary file changing as we'd only have to store the
    hunk that has changed.

This has been discussed e.g. in [1], and I've been talking with some
people about rolling hash functions.

In any case, getting to pluggale ODBs is likely a multi-year effort, so
I wonder how detailed we should be in the context of the document here.
We might want to mention that there are ideas and maybe even provide
some pointers, but I think it makes sense to defer the technical
discussion of how exactly this could look like to the future. Mostly
because I think it's going to be a rather big discussion on its own.

Patrick

[1]: https://lore.kernel.org/git/xmqqbkdometi.fsf@gitster.g/