Re: Performance issue: initial git clone causes massive repack

Jon Smirl <jonsmirl@xxxxxxxxx> · Mon, 6 Apr 2009 11:28:39 -0400

On Mon, Apr 6, 2009 at 11:14 AM, Nicolas Pitre <nico@xxxxxxx> wrote:
> On Mon, 6 Apr 2009, Jon Smirl wrote:
>
>> On Mon, Apr 6, 2009 at 10:19 AM, Nicolas Pitre <nico@xxxxxxx> wrote:
>> > On Mon, 6 Apr 2009, Jon Smirl wrote:
>> >
>> >> First thing an initial clone does is copy all of the pack files from
>> >> the server to the client without even looking at them.
>> >
>> > This is a no go for reasons already stated many times.  There are
>> > security implications (those packs might contain stuff that you didn't
>> > intend to be publically accessible) and there might be efficiency
>> > reasons as well (you might have a shared object store with lots of stuff
>> > unrelated to the particular clone).
>>
>> How do you deal with dense history packs? These packs take many hours
>> to make (on a server class machine) and can be half the size of a
>> regular pack. Shouldn't there be a way to copy these packs intact on
>> an initial clone? It's ok if these packs are specially marked as being
>> ok to copy.
>
> [sigh]
>
> Let me explain it all again.
>
> There is basically two ways to create a new pack: the intelligent way,
> and the bruteforce way.
>
> When creating a new pack the intelligent way, what we do is to enumerate
> all the needed object and look them up in the object store.  When a
> particular object is found, we create a record for that object and note
> in which pack it is located, at what offset in that pack, how much space
> it occupies in its compressed form within that pack, , and if whether it
> is a delta or not.  When that object is indeed a delta (the majority of
> objects usually are) then we also keep a pointer on the record for the
> base object for that delta.
>
> Next, for all objects in delta form which base object is also part of
> the object enumeration and obviously part of the same pack, we simply
> flag those objects as directly reusable without any further processing.
> This means that, when those objects are about to be stored in the new
> pack, their raw data is simply copied straight from the original pack
> using the offset and size noted above.  In other words, those objects
> are simply never redeltified nor redeflated at all, and all the work
> that was previously done to find the best delta match is preserved with
> no extra cost.

Does this process cause random reads all over a 2GB pack file? Busy
servers can't keep a 2GB pack in memory.
sendfile() the 2GB pack to client is way more efficient. (assuming the
pack is marked as being ok to send).

>
> Of course, when your repository is tightly packed into a single pack,
> then all enumerated objects fall into the reusable category and
> therefore a copy of the original pack is indeed sent over the wire.
> One exception is with older git clients which don't support the delta
> base offset encoding, in which case the delta reference encoding is
> substituted on the fly with almost no cost (this is btw another reason
> why a dumb copy of existing pack may not work universally either).  But
> in the common case, you might see the above as just the same as if git
> did copy the pack file because it really only reads some data from a
> pack and immediately writes that data out.
>
> The bruteforce repacking is different because it simply doesn't concern
> itself with existing deltas at all.  It instead start everything from
> scratch and perform the whole delta search all over for all objects.
> This is what takes lots of resources and CPU cycles, and as you may
> guess, is never used for fetch/clone requests.
>
>
> Nicolas
>

-- 
Jon Smirl
jonsmirl@xxxxxxxxx
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html