Re: Git performance results on a large repository

Greg Troxel <gdt@xxxxxxxxxx> · Sat, 04 Feb 2012 16:42:11 -0500

Joshua Redstone <joshua.redstone@xxxxxx> writes:

> The test repo has 4 million commits, linear history and about 1.3 million
> files.  The size of the .git directory is about 15GB, and has been
> repacked with 'git repack -a -d -f --max-pack-size=10g --depth=100
> --window=250'.  This repack took about 2 days on a beefy machine (I.e.,
> lots of ram and flash).  The size of the index file is 191 MB. I can share
> the script that generated it if people are interested - It basically picks
> 2-5 files, modifies a line or two and adds a few lines at the end
> consisting of random dictionary words, occasionally creates a new file,
> commits all the modifications and repeats.

I have a repository with about 500K files, 3.3G checkout, 1.5G .git, and
about 10K commits.  (This is a real repository, not a test case.)  So
not as many commits by a lot, but the size seems not so far off.

> I timed a few common operations with both a warm OS file cache and a cold
> cache.  i.e., I did a 'echo 3 | tee /proc/sys/vm/drop_caches' and then did
> the operation in question a few times (first timing is the cold timing,
> the next few are the warm timings).  The following results are on a server
> with average hard drive (I.e., not flash)  and > 10GB of ram.
>
> 'git status' :   39 minutes cold, and 24 seconds warm.

Both of these numbers surprise me.  I'm using NetBSD, whose stat
implementation isn't as optimized as Linux (you didn't say, but
assuming).   On a years-old desktop, git status seems to be about a
minute semi-cold and 5s warm (once I set the vnode cache big over 500K,
vs 350K default for a 2G ram machine).

So on the warm status, I wonder how big your vnode cache is, and if
you've exceeded it, and I don't follow the cold time at all.  Probably
some sort of profiling within git status would be illuminating.

> 'git blame':   44 minutes cold, 11 minutes warm.
>
> 'git add' (appending a few chars to the end of a file and adding it):   7
> seconds cold and 5 seconds warm.
>
> 'git commit -m "foo bar3" --no-verify --untracked-files=no --quiet
> --no-status':  41 minutes cold, 20 seconds warm.  I also hacked a version
> of git to remove the three or four places where 'git commit' stats every
> file in the repo, and this dropped the times to 30 minutes cold and 8
> seconds warm.

So without the stat, I wonder what it's doing that takes 30 minutes.

> One way to get there is to do some deep code modifications to git
> internals, to, for example, create some abstractions and interfaces that
> allow plugging in the specialized servers.  Another way is to leave git
> internals as they are and develop a layer of wrapper scripts around all
> the git commands that do the necessary interfacing.  The wrapper scripts
> seem perhaps easier in the short-term, but may lead to increasing
> divergence from how git behaves natively and also a layer of complexity.

Having hooks for a blame server cache, etc. sounds sensible.  Having a
way to call blames sort of like with --since and then keep updating it
(eg. in emacs) to earlier times sounds useful.
Attachment:
pgp0zMCW32YJ0.pgp

Description: PGP signature