Re: Git Scaling: What factors most affect Git performance for a large repo?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Fri, Feb 20, 2015 at 7:09 PM, Ævar Arnfjörð Bjarmason
<avarab@xxxxxxxxx> wrote:
>>> But actually most of "git fetch" is spent in the reachability check
>>> subsequently done by "git-rev-list" which takes several seconds. I
>>
>> I wonder if reachability bitmap could help here..
>
> I could have sworn I had that enabled already but evidently not. I did
> test it and it cut down on clone times a bit. Now our daily repacking
> is:
>
>         git --git-dir={} gc &&
>         git --git-dir={} pack-refs --all --prune &&
>         git --git-dir={} repack -Ad --window=250 --depth=100
> --write-bitmap-index --pack-kept-objects &&
>
> It's not clear to me from the documentation whether this should just
> be enabled on the server, or the clients too. In any case I've enabled
> it on both.

Pack bitmaps matter most on the server side. What I was not sure was
whether it helped the client side as well because you do rev-list on
the client side for reachability test. But thinking again, I don't
think enabling pack bitmaps on the client side helps much. The "--not
--all" part in rev-list basically just traverses commits, not trees
and objects (where pack bitmaps shine). The big problem here is
"--all" which will go examine all refs. So big ref number problem
again..

> Even then with it enabled on both a "git pull" that pulls down just
> one commit on one branch is 13s. Trace attached at the end of the
> mail.
>
>>> haven't looked into it but there's got to be room for optimization
>>> there, surely it only has to do reachability checks for new refs, or
>>> could run in some "I trust this remote not to send me corrupt data"
>>> completely mode (which would make sense within a company where you can
>>> trust your main Git box).
>>
>> No, it's not just about trusting the server side, it's about catching
>> data corruption on the wire as well. We have a trick to avoid
>> reachability check in clone case, which is much more expensive than a
>> fetch. Maybe we could do something further to help the fetch case _if_
>> reachability bitmaps don't help.
>
> Still, if that's indeed a big bottleneck what's the worst-case
> scenario here? That the local repository gets hosed? The server will
> still recursively validate the objects it gets sent, right?

The server is under pressure to pack and send data fast so it does not
validate as heavily as the client. When deltas are reused, only crc32
is verified. When deltas are generated, the server must unpack some
objects for deltification, but I don't think it rehashes the content
to see if it produces the same SHA-1. Single bit flips could go
unnoticed..

> I wonder if a better trade-off in that case would be to skip this in
> some situations and instead put something like "git fsck" in a
> cronjob.

Either that or be optimistic, accept the pack (i.e. git-fetch returns
quickly) and validate it in the background. If the pack is indeed
good, you don't have to wait until validation is done. If the pack is
bad, you would know after a minute or two, hopefully you can still
recover from that point.
-- 
Duy
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html




[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]