Re: [PATCH 0/2] optimizing pack access on "read only" fetch repos

Martin Fick <mfick@xxxxxxxxxxxxxx> · Tue, 29 Jan 2013 08:25:39 -0700

Jeff King <peff@xxxxxxxx> wrote:

>On Sat, Jan 26, 2013 at 10:32:42PM -0800, Junio C Hamano wrote:
>
>> Both makes sense to me.
>> 
>> I also wonder if we would be helped by another "repack" mode that
>> coalesces small packs into a single one with minimum overhead, and
>> run that often from "gc --auto", so that we do not end up having to
>> have 50 packfiles.
>> 
>> When we have 2 or more small and young packs, we could:
>> 
>>  - iterate over idx files for these packs to enumerate the objects
>>    to be packed, replacing read_object_list_from_stdin() step;
>> 
>>  - always choose to copy the data we have in these existing packs,
>>    instead of doing a full prepare_pack(); and
>> 
>>  - use the order the objects appear in the original packs, bypassing
>>    compute_write_order().
>
>I'm not sure. If I understand you correctly, it would basically just be
>concatenating packs without trying to do delta compression between the
>objects which are ending up in the same pack. So it would save us from
>having to do (up to) 50 binary searches to find an object in a pack,
>but
>would not actually save us much space.
>
>I would be interested to see the timing on how quick it is compared to
>a
>real repack, as the I/O that happens during a repack is non-trivial
>(although if you are leaving aside the big "main" pack, then it is
>probably not bad).
>
>But how do these somewhat mediocre concatenated packs get turned into
>real packs? Pack-objects does not consider deltas between objects in
>the
>same pack. And when would you decide to make a real pack? How do you
>know you have 50 young and small packs, and not 50 mediocre coalesced
>packs?

If we are reconsidering repacking strategies, I would like to propose an approach that might be a more general improvement to repacking which would help in more situations. 

You could roll together any packs which are close in size, say within 50% of each other.  With this strategy you will end up with files which are spread out by size exponentially.  I implementated this strategy on top of the current gc script using keep files, it works fairly well:

https://gerrit-review.googlesource.com/#/c/35215/3/contrib/git-exproll.sh

This saves some time, but mostly it saves I/O when repacking regularly.  I suspect that if this strategy were used in core git that further optimizations could be made to also reduce the repack time, but I don't know enough about repacking to know?  We run it nightly on our servers, both write and read only mirrors. We us are a ratio of 5 currently to drastically reduce large repack file rollovers,

-Martin

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html