Nguyen Thai Ngoc Duy <pclouds@xxxxxxxxx> writes: > I'm looking at adding large file support to unpack-objects. A simple > way is just to stream large blobs into loose large blobs. But I'd > rather keep those blobs in pack because pack-objects is happier that > way. I'm looking at unpack-objects and thinking maybe it's best to > just merge it back to index-pack. > > In normal mode (all small objects), index-pack receives the pack > stream. Objects will be unpacked in phase two, resolving objects in > index-pack. The only downside I can see is the new unpack-objects now > write temporary pack on disk, which does not sound too bad to me. > unpack-packs is called on small packs so extra space is small. For > single-huge-blob packs, it's good to keep them on disk anyway. When > the pack has large blobs, we could either just keep full pack. > > After this, the only pack receiver at client side is index-pack. > fetch-pack does not have to choose between unpack-objects and > index-pack, just pass --unpack-limit <n> to index-pack. > > What do you think? I think it is beneficial to step back a bit. What is the _real_ reason why we call unpack-objects instead of index-pack when we receive only a handful of objects? I think we did this to avoid littering the receiving repository with too many packs from individual transfers. As long as that goal is met, a solution that replaces the current "if we are going to get less than N objects, explode them to loose objects" does not actually have to explode them to loose objects. It could explode normal objects into loose ones, while appending large ones into an existing pack (and it has to fix up the pack .idx after doing so), for example. Or it could even choose to _always_ append into an existing pack designated for appending new objects. Or it could punt the "appending" part, declaring that large object problem is a rare event, and create/leave a new pack in the repository that stores a large object (this however would not satisfy "do not litter the receiving repository with too many packs" goal if "large object problem" is not rare enough). And the first step to make that happen would be to let a single receiver program, instead of receive-pack/fetch-pack, make the decision. That receiver program _might_ benefit from knowing how many objects it is going to receive when making the decision before seeing a single byte from the packstream, but there are other more meaningful data you can learn only after looking at what is in the pack. So I like the general direction you are heading. Probably the first step in the right structure of such a series would introduce a new helper program that builtin/receive-pack.c::unpack() and builtin/fetch-pack.c::get_pack() call, remove the header-peeking these calling processes currently do, and make that new helper responsible for switching between unpack/index-pack (the new helper may peek the header instead). The first implementation of the new helper may decide exactly like how these two functions choose between the two. Once that is done, it will be an implementation detail of how objects in the incoming packstream is stored locally from the point of view of fetch-pack and receive-pack, and nobody should notice when the new helper is updated to call only the updated index-pack that knows to stream large (or all) objects into a pack. -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html