Re: merge unpack-objects to index-pack?

Junio C Hamano <gitster@xxxxxxxxx> · Fri, 11 May 2012 10:51:33 -0700

Nguyen Thai Ngoc Duy <pclouds@xxxxxxxxx> writes:

> I'm looking at adding large file support to unpack-objects. A simple
> way is just to stream large blobs into loose large blobs. But I'd
> rather keep those blobs in pack because pack-objects is happier that
> way. I'm looking at unpack-objects and thinking maybe it's best to
> just merge it back to index-pack.
>
> In normal mode (all small objects), index-pack receives the pack
> stream. Objects will be unpacked in phase two, resolving objects in
> index-pack. The only downside I can see is the new unpack-objects now
> write temporary pack on disk, which does not sound too bad to me.
> unpack-packs is called on small packs so extra space is small. For
> single-huge-blob packs, it's good to keep them on disk anyway. When
> the pack has large blobs, we could either just keep full pack.
>
> After this, the only pack receiver at client side is index-pack.
> fetch-pack does not have to choose between unpack-objects and
> index-pack, just pass --unpack-limit <n> to index-pack.
>
> What do you think?

I think it is beneficial to step back a bit.  What is the _real_ reason
why we call unpack-objects instead of index-pack when we receive only a
handful of objects?

I think we did this to avoid littering the receiving repository with too
many packs from individual transfers.  As long as that goal is met, a
solution that replaces the current "if we are going to get less than N
objects, explode them to loose objects" does not actually have to explode
them to loose objects.  It could explode normal objects into loose ones,
while appending large ones into an existing pack (and it has to fix up the
pack .idx after doing so), for example.  Or it could even choose to
_always_ append into an existing pack designated for appending new
objects.  Or it could punt the "appending" part, declaring that large
object problem is a rare event, and create/leave a new pack in the
repository that stores a large object (this however would not satisfy "do
not litter the receiving repository with too many packs" goal if "large
object problem" is not rare enough).

And the first step to make that happen would be to let a single receiver
program, instead of receive-pack/fetch-pack, make the decision.  That
receiver program _might_ benefit from knowing how many objects it is going
to receive when making the decision before seeing a single byte from the
packstream, but there are other more meaningful data you can learn only
after looking at what is in the pack.

So I like the general direction you are heading.

Probably the first step in the right structure of such a series would
introduce a new helper program that builtin/receive-pack.c::unpack() and
builtin/fetch-pack.c::get_pack() call, remove the header-peeking these
calling processes currently do, and make that new helper responsible for
switching between unpack/index-pack (the new helper may peek the header
instead).  The first implementation of the new helper may decide exactly
like how these two functions choose between the two.

Once that is done, it will be an implementation detail of how objects in
the incoming packstream is stored locally from the point of view of
fetch-pack and receive-pack, and nobody should notice when the new helper
is updated to call only the updated index-pack that knows to stream large
(or all) objects into a pack.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html