Hi Git folks, We (Facebook) have been investigating source control systems to meet our growing needs. We already use git fairly widely, but have noticed it getting slower as we grow, and we want to make sure we have a good story going forward. We're debating how to proceed and would like to solicit people's thoughts. To better understand git scalability, I've built up a large, synthetic repository and measured a few git operations on it. I summarize the results here. The test repo has 4 million commits, linear history and about 1.3 million files. The size of the .git directory is about 15GB, and has been repacked with 'git repack -a -d -f --max-pack-size=10g --depth=100 --window=250'. This repack took about 2 days on a beefy machine (I.e., lots of ram and flash). The size of the index file is 191 MB. I can share the script that generated it if people are interested - It basically picks 2-5 files, modifies a line or two and adds a few lines at the end consisting of random dictionary words, occasionally creates a new file, commits all the modifications and repeats. I timed a few common operations with both a warm OS file cache and a cold cache. i.e., I did a 'echo 3 | tee /proc/sys/vm/drop_caches' and then did the operation in question a few times (first timing is the cold timing, the next few are the warm timings). The following results are on a server with average hard drive (I.e., not flash) and > 10GB of ram. 'git status' : 39 minutes cold, and 24 seconds warm. 'git blame': 44 minutes cold, 11 minutes warm. 'git add' (appending a few chars to the end of a file and adding it): 7 seconds cold and 5 seconds warm. 'git commit -m "foo bar3" --no-verify --untracked-files=no --quiet --no-status': 41 minutes cold, 20 seconds warm. I also hacked a version of git to remove the three or four places where 'git commit' stats every file in the repo, and this dropped the times to 30 minutes cold and 8 seconds warm. The git performance we observed here is too slow for our needs. So the question becomes, if we want to keep using git going forward, what's the best way to improve performance. It seems clear we'll probably need some specialized servers (e.g., to perform git-blame quickly) and maybe specialized file system integration to detect what files have changed in a working tree. One way to get there is to do some deep code modifications to git internals, to, for example, create some abstractions and interfaces that allow plugging in the specialized servers. Another way is to leave git internals as they are and develop a layer of wrapper scripts around all the git commands that do the necessary interfacing. The wrapper scripts seem perhaps easier in the short-term, but may lead to increasing divergence from how git behaves natively and also a layer of complexity. Thoughts? Cheers, Josh -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html