Re: Git Scaling: What factors most affect Git performance for a large repo?

"brian m. carlson" <sandals@xxxxxxxxxxxxxxxxxxxx> · Fri, 20 Feb 2015 00:03:20 +0000

On Thu, Feb 19, 2015 at 04:26:58PM -0500, Stephen Morton wrote:
> I posted this to comp.version-control.git.user and didn't get any response. I
> think the question is plumbing-related enough that I can ask it here.
> 
> I'm evaluating the feasibility of moving my team from SVN to git. We have a very
> large repo. [1] We will have a central repo using GitLab (or similar) that
> everybody works with. Forks, code sharing, pull requests etc. will be done
> through this central server.
> 
> By 'performance', I guess I mean speed of day to day operations for devs.
> 
>    * (Obviously, trivially, a (non-local) clone will be slow with a large repo.)
>    * Will a few simultaneous clones from the central server also slow down
>      other concurrent operations for other users?

This hasn't been a problem for us at $DAYJOB.  Git doesn't lock anything 
on fetches, so each process is independent.  We probably have about 
sixty developers (and maybe twenty other occasional users) that manage 
to interact with our Git server all day long.  We also have probably 
twenty smoker (CI) systems pulling at two hour intervals, or, when 
there's nothing to do, every two minutes, plus probably fifteen to 
twenty build systems pulling hourly.

I assume you will provide adequate resources for your server.

>    * Will 'git pull' be slow?
>    * 'git push'?

The most pathological case I've seen for git push is a branch with a 
single commit merged into the main development branch.  As of Git 2.3.0, 
the performance regression here is fixed.

Obviously, the speed of your network connection will affect this.  Even 
at 30 MB/s, cloning several gigabytes of data takes time.  Git tries 
hard to eliminate sending a lot of data, so if your developers keep 
reasonably up-to-date, the cost of establishing the connection will tend 
to dominate.

I see pull and push times that are less than 2 seconds in most cases.

>    * 'git commit'? (It is listed as slow in reference [3].)
>    * 'git stautus'? (Slow again in reference 3 though I don't see it.)

These can be slow with slow disks or over remote file systems.  I 
recommend not doing that.  I've heard rumbles that disk performance is 
better on Unix, but I don't use Windows so I can't say.

You should keep your .gitignore files up-to-date to avoid enumerating 
untracked files.  There's some work towards making this less of an 
issue.

git blame can be somewhat slow, but it's not something I use more than 
about once a day, so it doesn't bother me that much.

> Assuming I can put lots of resources into a central server with lots of CPU,
> RAM, fast SSD, fast networking, what aspects of the repo are most likely to
> affect devs' experience?
>    * Number of commits
>    * Sheer disk space occupied by the repo

The number of files can impact performance due to the number of stat()s 
required.

>    * Number of tags.
>    * Number of branches.

The number of tags and branches individually is really less relevant 
than the total number of refs (tags, branches, remote branches, etc). 
Very large numbers of refs can impact performance on pushes and pulls 
due to the need to enumerate them all.

>    * Binary objects in the repo that cause it to bloat in size [1]
>    * Other factors?

If you want good performance, I'd recommend the latest version of Git 
both client- and server-side.  Newer versions of Git provide pack 
bitmaps, which can dramatically speed up clones and fetches, and Git 
2.3.0 fixes a performance regression with large numbers of refs in 
non-shallow repositories.

It is totally worth it to roll your own packages of git if your vendor 
provides old versions.

> Of the various HW items listed above --CPU speed, number of cores, RAM, SSD,
> networking-- which is most critical here?

I generally find that having a good disk cache is important with large 
repositories.  It may be advantageous to make sure the developer 
machines have adequate memory.  Performance is notably better on 
development machines (VMs) with 2 GB or 4 GB of memory instead of 1 GB.

I can't speak to the server side, as I'm not directly involved with its 
deployment.

> Assume ridiculous numbers. Let me exaggerate: say 1 million commits, 15 GB repo,
> 50k tags, 1,000 branches. (Due to historical code fixups, another 5,000 "fix-up
> branches" which are just one little dangling commit required to change the code
> a little bit between a commit a tag that was not quite made from it.)

I routinely work on a repo that's 1.9 GB packed, with 25k (and rapidly 
growing) refs.  Other developers work on a repo that's 9 GB packed, with 
somewhat fewer refs.  We don't tend to have problems with this.

Obviously, performance is better on some of our smaller repos, but it's 
not unacceptable on the larger ones.  I generally find that the 940 KB 
repo with huge numbers of files performs worse than the 1.9 GB repo with 
somewhat fewer.  If you can split your repository into multiple logical 
repositories, that will certainly improve performance.

If you end up having pain points, we're certainly interested in 
working through those.  I've brought up performance problems and people 
are generally responsive.
-- 
brian m. carlson / brian with sandals: Houston, Texas, US
+1 832 623 2791 | http://www.crustytoothpaste.net/~bmc | My opinion only
OpenPGP: RSA v4 4096b: 88AC E9B2 9196 305B A994 7552 F1BA 225C 0223 B187
Attachment:
signature.asc

Description: Digital signature