On Tue, Dec 10, 2024 at 08:43:03PM +0900, Junio C Hamano wrote: > Christian Couder <christian.couder@xxxxxxxxx> writes: > > +In other words, the goal of this document is not to talk about all the > > +possible ways to optimize how Git could handle large blobs, but to > > +describe how a LOP based solution could work well and alleviate a > > +number of current issues in the context of Git clients and servers > > +sharing Git objects. > > But if you do not discuss even a single way, and handwave "we'll > have this magical object storage that would solve all the problems > for us", then we cannot really tell if the problem is solved by us, > or by handwaved away by assuming the magical object storage. We'd > need at least one working example. It's something we're working on in parallel with the effort to slowly move towards pluggable object databases. We aren't yet totally clear on how exactly to store such objects, but there are a couple of ideas: - Store large objects verbatim in a separate path without any kind of compression at all. It solves the problem of wasting compute time during compression, but does not solve the problem of having to store blobs multiple times even if only a tiny part of them change. - Use a rolling hash function to split up large objects into smaller hunks that can be deduplicated. This solves the issue of only small parts of the binary file changing as we'd only have to store the hunk that has changed. This has been discussed e.g. in [1], and I've been talking with some people about rolling hash functions. In any case, getting to pluggale ODBs is likely a multi-year effort, so I wonder how detailed we should be in the context of the document here. We might want to mention that there are ideas and maybe even provide some pointers, but I think it makes sense to defer the technical discussion of how exactly this could look like to the future. Mostly because I think it's going to be a rather big discussion on its own. Patrick [1]: https://lore.kernel.org/git/xmqqbkdometi.fsf@gitster.g/