Re: Git performance results on a large repository

Joshua Redstone <joshua.redstone@xxxxxx> · Fri, 3 Feb 2012 17:00:02 +0000

Hi Ævar,

Thanks for the comments.  I've included a bunch more info on the test repo
below.  It is based on a growth model of two of our current repositories
(I.e., it's not a perforce import). We already have some of the easily
separable projects in separate repositories, like HPHP.   If we could
split our largest repos into multiple ones, that would help the scaling
issue.  However, the code in those repos is rather interdependent and we
believe it'd hurt more than help to split it up, at least for the
medium-term future.  We derive a fair amount of benefit from the code
sharing and keeping things together in a single repo, so it's not clear
when it'd make sense to get more aggressive splitting things up.

Some more information on the test repository:   The working directory is
9.5 GB, the median file size is 2 KB.  The average depth of a directory
(counting the number of '/'s) is 3.6 levels and the average depth of a
file is 4.6.  More detailed histograms of the repository composition is
below:

------------------------

Histogram of depth of every directory in the repo (dirs=`find . -type d` ;
(for dir in $dirs; do t=${dir//[^\/]/}; echo ${#t} ; done) |
~/tmp/histo.py)
* The .git directory itself has only 161 files, so although included,
doesn't affect the numbers significantly)

[0.0 - 1.3): 271
[1.3 - 2.6): 9966
[2.6 - 3.9): 56595
[3.9 - 5.2): 230239
[5.2 - 6.5): 67394
[6.5 - 7.8): 22868
[7.8 - 9.1): 6568
[9.1 - 10.4): 420
[10.4 - 11.7): 45
[11.7 - 13.0]: 21
n=394387 mean=4.671830, median=5.000000, stddev=1.272658

Histogram of depth of every file in the repo (files=`git ls-files` ; (for
file in $files; do t=${file//[^\/]/}; echo ${#t} ; done) | ~/tmp/histo.py)
* 'git ls-files' does not prefix entries with ./, like the 'find' command
above, does, hence why the average appears to be the same as the directory
stats

[0.0 - 1.3]: 1274
[1.3 - 2.6]: 35353
[2.6 - 3.9]: 196747
[3.9 - 5.2]: 786647
[5.2 - 6.5]: 225913
[6.5 - 7.8]: 77667
[7.8 - 9.1]: 22130
[9.1 - 10.4]: 1599
[10.4 - 11.7]: 164
[11.7 - 13.0]: 118
n=1347612 mean=4.655750, median=5.000000, stddev=1.278399

Histogram of file sizes (for first 50k files - this command takes a
while):  files=`git ls-files` ; (for file in $files; do stat -c%s $file ;
done) | ~/tmp/histo.py

[ 0.0 - 4.7): 0
[ 4.7 - 22.5): 2
[ 22.5 - 106.8): 0
[ 106.8 - 506.8): 0
[ 506.8 - 2404.7): 31142
[ 2404.7 - 11409.9): 17837
[ 11409.9 - 54137.1): 942
[ 54137.1 - 256866.9): 53
[ 256866.9 - 1218769.7): 18
[ 1218769.7 - 5782760.0]: 5
n=49999 mean=3590.953239, median=1772.000000, stddev=42835.330259

Cheers,
Josh

On 2/3/12 9:56 AM, "Ævar Arnfjörð Bjarmason" <avarab@xxxxxxxxx> wrote:

>On Fri, Feb 3, 2012 at 15:20, Joshua Redstone <joshua.redstone@xxxxxx>
>wrote:
>
>> We (Facebook) have been investigating source control systems to meet our
>> growing needs.  We already use git fairly widely, but have noticed it
>> getting slower as we grow, and we want to make sure we have a good story
>> going forward.  We're debating how to proceed and would like to solicit
>> people's thoughts.
>
>Where I work we also have a relatively large Git repository. Around
>30k files, a couple of hundred thousand commits, clone size around
>half a GB.
>
>You haven't supplied background info on this but it really seems to me
>like your testcase is converting something like a humongous Perforce
>repository directly to Git.
>
>While you /can/ do this it's not a good idea, you should split up
>repositories at the boundaries code or data doesn't directly cross
>over, e.g. there's no reason why you need HipHop PHP in the same
>repository as Cassandra or the Facebook chat system, is there?
>
>While Git could better with large repositories (in particular applying
>commits in interactive rebase seems to be to slow down on bigger
>repositories) there's only so much you can do about stat-ing 1.3
>million files.
>
>A structure that would make more sense would be to split up that giant
>repository into a lot of other repositories, most of them probably
>have no direct dependencies on other components, but even those that
>do can sometimes just use some other repository as a submodule.
>
>Even if you have the requirement that you'd like to roll out
>*everything* at a certain point in time you can still solve that with
>a super-repository that has all the other ones as submodules, and
>creates a tag for every rollout or something like that.

��.n��������+%������w��{.n��������n�r������&��z�ޗ�zf���h���~����������_��+v���)ߣ�