Git performance results on a large repository

Joshua Redstone <joshua.redstone@xxxxxx> · Fri, 3 Feb 2012 14:20:06 +0000

Hi Git folks,

We (Facebook) have been investigating source control systems to meet our
growing needs.  We already use git fairly widely, but have noticed it
getting slower as we grow, and we want to make sure we have a good story
going forward.  We're debating how to proceed and would like to solicit
people's thoughts.

To better understand git scalability, I've built up a large, synthetic
repository and measured a few git operations on it.  I summarize the
results here.

The test repo has 4 million commits, linear history and about 1.3 million
files.  The size of the .git directory is about 15GB, and has been
repacked with 'git repack -a -d -f --max-pack-size=10g --depth=100
--window=250'.  This repack took about 2 days on a beefy machine (I.e.,
lots of ram and flash).  The size of the index file is 191 MB. I can share
the script that generated it if people are interested - It basically picks
2-5 files, modifies a line or two and adds a few lines at the end
consisting of random dictionary words, occasionally creates a new file,
commits all the modifications and repeats.

I timed a few common operations with both a warm OS file cache and a cold
cache.  i.e., I did a 'echo 3 | tee /proc/sys/vm/drop_caches' and then did
the operation in question a few times (first timing is the cold timing,
the next few are the warm timings).  The following results are on a server
with average hard drive (I.e., not flash)  and > 10GB of ram.

'git status' :   39 minutes cold, and 24 seconds warm.

'git blame':   44 minutes cold, 11 minutes warm.

'git add' (appending a few chars to the end of a file and adding it):   7
seconds cold and 5 seconds warm.

'git commit -m "foo bar3" --no-verify --untracked-files=no --quiet
--no-status':  41 minutes cold, 20 seconds warm.  I also hacked a version
of git to remove the three or four places where 'git commit' stats every
file in the repo, and this dropped the times to 30 minutes cold and 8
seconds warm.

The git performance we observed here is too slow for our needs.  So the
question becomes, if we want to keep using git going forward, what's the
best way to improve performance.  It seems clear we'll probably need some
specialized servers (e.g., to perform git-blame quickly) and maybe
specialized file system integration to detect what files have changed in a
working tree.

One way to get there is to do some deep code modifications to git
internals, to, for example, create some abstractions and interfaces that
allow plugging in the specialized servers.  Another way is to leave git
internals as they are and develop a layer of wrapper scripts around all
the git commands that do the necessary interfacing.  The wrapper scripts
seem perhaps easier in the short-term, but may lead to increasing
divergence from how git behaves natively and also a layer of complexity.

Thoughts?

Cheers,
Josh

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html