Re: [PATCHv2 2/2] receive-pack: Add receive.objectCountLimit to refuse push with too many objects

Shawn Pearce <spearce@xxxxxxxxxxx> · Sat, 14 May 2011 15:27:36 -0700

On Sat, May 14, 2011 at 10:50, Junio C Hamano <gitster@xxxxxxxxx> wrote:
> Shawn Pearce <spearce@xxxxxxxxxxx> writes:
>
>> If we are going to put limits in, does it make sense to try and push
>> these limits back to pack-objects in a more detailed way? You talked
>> about depth of history, or size of pack. pack-objects could
>> approximate both.
>
> I think the receiving end switches between index-pack and unpack-objects
> based on the object count,

It does...

> but if we can add a protocol extension to carry
> these estimates from the sender, we would be much better off. A large push
> with a small number of objects would want to keep everything in a pack,
> for example, but currently there is no way to do so.

Yes, my thoughts exactly. A 600 MiB push of 3 objects (1 huge blob,
top level tree, commit) shouldn't be unpacked. Serving that huge blob
back out to clients that want to fetch it is more efficient when its
coming from a pack than from a loose object, due to the object reuse
that is available from a pack. Unfortunately we don't have the data
necessary to make that decision.

Probably the cleanest way to do this estimate transmission is to
modify the pack stream format. Introduce a new capability for
pack-stats. If advertised from receive-pack, send-pack can pass a
--pack-stats flag to pack-objects. That process then writes out a PACK
header with version code 4 (rather than 2), and includes additional
data beyond the object count in an extended PACK header. receive-pack
can then read the PACK header and handle version 2 as it does today,
and version 4 by also checking the additional stat fields and dumping
what is left over into either unpack-objects or index-pack like it
does now. We wouldn't even need to change index-pack or
unpack-objects, receive-pack could translate the PACK version 4 header
into a version 2 command line flag when it invokes the sub process to
handle the stream.

Hmm, or maybe this should be pack v5, just so nobody confuses it with
the mythical pack v4 we haven't implemented.

If we do add this stats header to the PACK stream, we need to be a bit
smarter about it. I would lay out the header with an additional 4 byte
field telling us the size of the stats block, and then use repeating
key-value pairs within the stats block to transmit the data. This way
clients can add additional stats to the stream and receive-pack can
ignore them if it doesn't understand them. We can start out with
number of commits and estimated size of the packed data (based on
summing up the reused object sizes) but could add more later like date
of oldest commit if we later decide it would help the remote handle
the pack stream, or reject it early.

As far as receive-pack rejecting the stream after seeing the header, I
would like to avoid an extra round-trip between the two peers. So we
probably need to modify send-pack like I suggest above to check the
stream for an error message after pack-objects terminates, even if
pack-objects terminates with a non-zero exit status. Even over a POSIX
pipe we should have room for 512 bytes of error data to be crammed
into the stream from receive-pack before we deadlock the two peers. If
we wanted to be really paranoid, we would run pack-objects in the
background and have send-pack constantly monitor the incoming half of
the pipe for a status result. Since pack-objects is already another
process this shouldn't be that difficult. (But it is harder for me in
JGit where its not a separate thread.)

-- 
Shawn.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html