Re: [PATCH] git exproll: steps to tackle gc aggression

Martin Fick <mfick@xxxxxxxxxxxxxx> · Tue, 6 Aug 2013 18:10:46 -0600

On Tuesday, August 06, 2013 06:24:50 am Duy Nguyen wrote:
> On Tue, Aug 6, 2013 at 9:38 AM, Ramkumar Ramachandra 
<artagnon@xxxxxxxxx> wrote:
> > +               Garbage collect using a pseudo
> > logarithmic packfile maintenance +              
> > approach.  This approach attempts to minimize packfile
> > churn +               by keeping several generations
> > of varying sized packfiles around +               and
> > only consolidating packfiles (or loose objects) which
> > are +               either new packfiles, or packfiles
> > close to the same size as +               another
> > packfile.
> 
> I wonder if a simpler approach may be nearly efficient as
> this one: keep the largest pack out, repack the rest at
> fetch/push time so there are at most 2 packs at a time.
> Or we we could do the repack at 'gc --auto' time, but
> with lower pack threshold (about 10 or so). When the
> second pack is as big as, say half the size of the
> first, merge them into one at "gc --auto" time. This can
> be easily implemented in git-repack.sh.

It would definitely be better than the current gc approach.  

However, I suspect it is still at least one to two orders of 
magnitude off from where it should be.  To give you a real 
world example, on our server today when gitexproll ran on 
our kernel/msm repo, it consolidated 317 pack files into one 
almost 8M packfile (it compresses/dedupes shockingly well, 
one of those new packs was 33M).  Our largest packfile in 
that repo is 1.5G!  

So let's now imagine that the second closest packfile is 
only 100M, it would keep getting consolidated with 8M worth 
of data every day (assuming the same conditions and no extra 
compression).  That would take (750M-100M)/8M ~ 81 days to 
finally build up large enough to no longer consolidate the 
new packs with the second largest pack file daily.  During 
those 80+ days, it will be on average writing 325M too much 
per day (when it should be writing just 8M).

So I can see the appeal of a simple solution, unfortunately 
I think one layer would still "suck" though.  And if you are 
going to add even just one extra layer, I suspect that you 
might as well go the full distance since you probably 
already need to implement the logic to do so?

-Martin

-- 
The Qualcomm Innovation Center, Inc. is a member of Code 
Aurora Forum, hosted by The Linux Foundation

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html