Re: Git Scaling: What factors most affect Git performance for a large repo?

Stephen Morton <stephen.c.morton@xxxxxxxxx> · Fri, 20 Feb 2015 11:06:44 -0500

This is fantastic. I really appreciate all the answers. And it's great
that I think I've sparked some general discussion that could lead
somewhere too.

Notes:

I'm currently using 2.1.3. I'll move to 2.3.x

I'm experimenting with git-annex to reduce repo size on disk. We'll see.

I could remove all tags older than /n/ years old in the active repo
and just maintain them in the historical repo. (We have quite a lot of
CI-generated tags.) It sounds like that might improve performance.

Questions:

1. Ævar : I'm a bit concerned by your statement that git rebases take
about 1-2 s per commit. Does that mean that a "git pull --rebase", if
it is picking up say 120 commits (not at all unrealistic), could
potentially take 4 minutes to complete? Or have I misinterpreted your
comment.

2. I'd not heard about bitmap indexes before this thread but it sounds
like they should help me. In limited searching I can't find much
useful documentation about them. It is also not clear to me if I have
to explicitly run "git repack --write-bitmap-indexes" or if git will
automatically detect when they're needed; first experiments seem to
indicate that I need to explicitly generate them. I assume that once
the index is there, git will just use it automatically.

Steve

On Thu, Feb 19, 2015 at 7:03 PM, brian m. carlson
<sandals@xxxxxxxxxxxxxxxxxxxx> wrote:
> On Thu, Feb 19, 2015 at 04:26:58PM -0500, Stephen Morton wrote:
>> I posted this to comp.version-control.git.user and didn't get any response. I
>> think the question is plumbing-related enough that I can ask it here.
>>
>> I'm evaluating the feasibility of moving my team from SVN to git. We have a very
>> large repo. [1] We will have a central repo using GitLab (or similar) that
>> everybody works with. Forks, code sharing, pull requests etc. will be done
>> through this central server.
>>
>> By 'performance', I guess I mean speed of day to day operations for devs.
>>
>>    * (Obviously, trivially, a (non-local) clone will be slow with a large repo.)
>>    * Will a few simultaneous clones from the central server also slow down
>>      other concurrent operations for other users?
>
> This hasn't been a problem for us at $DAYJOB.  Git doesn't lock anything
> on fetches, so each process is independent.  We probably have about
> sixty developers (and maybe twenty other occasional users) that manage
> to interact with our Git server all day long.  We also have probably
> twenty smoker (CI) systems pulling at two hour intervals, or, when
> there's nothing to do, every two minutes, plus probably fifteen to
> twenty build systems pulling hourly.
>
> I assume you will provide adequate resources for your server.
>
>>    * Will 'git pull' be slow?
>>    * 'git push'?
>
> The most pathological case I've seen for git push is a branch with a
> single commit merged into the main development branch.  As of Git 2.3.0,
> the performance regression here is fixed.
>
> Obviously, the speed of your network connection will affect this.  Even
> at 30 MB/s, cloning several gigabytes of data takes time.  Git tries
> hard to eliminate sending a lot of data, so if your developers keep
> reasonably up-to-date, the cost of establishing the connection will tend
> to dominate.
>
> I see pull and push times that are less than 2 seconds in most cases.
>
>>    * 'git commit'? (It is listed as slow in reference [3].)
>>    * 'git stautus'? (Slow again in reference 3 though I don't see it.)
>
> These can be slow with slow disks or over remote file systems.  I
> recommend not doing that.  I've heard rumbles that disk performance is
> better on Unix, but I don't use Windows so I can't say.
>
> You should keep your .gitignore files up-to-date to avoid enumerating
> untracked files.  There's some work towards making this less of an
> issue.
>
> git blame can be somewhat slow, but it's not something I use more than
> about once a day, so it doesn't bother me that much.
>
>> Assuming I can put lots of resources into a central server with lots of CPU,
>> RAM, fast SSD, fast networking, what aspects of the repo are most likely to
>> affect devs' experience?
>>    * Number of commits
>>    * Sheer disk space occupied by the repo
>
> The number of files can impact performance due to the number of stat()s
> required.
>
>>    * Number of tags.
>>    * Number of branches.
>
> The number of tags and branches individually is really less relevant
> than the total number of refs (tags, branches, remote branches, etc).
> Very large numbers of refs can impact performance on pushes and pulls
> due to the need to enumerate them all.
>
>>    * Binary objects in the repo that cause it to bloat in size [1]
>>    * Other factors?
>
> If you want good performance, I'd recommend the latest version of Git
> both client- and server-side.  Newer versions of Git provide pack
> bitmaps, which can dramatically speed up clones and fetches, and Git
> 2.3.0 fixes a performance regression with large numbers of refs in
> non-shallow repositories.
>
> It is totally worth it to roll your own packages of git if your vendor
> provides old versions.
>
>> Of the various HW items listed above --CPU speed, number of cores, RAM, SSD,
>> networking-- which is most critical here?
>
> I generally find that having a good disk cache is important with large
> repositories.  It may be advantageous to make sure the developer
> machines have adequate memory.  Performance is notably better on
> development machines (VMs) with 2 GB or 4 GB of memory instead of 1 GB.
>
> I can't speak to the server side, as I'm not directly involved with its
> deployment.
>
>> Assume ridiculous numbers. Let me exaggerate: say 1 million commits, 15 GB repo,
>> 50k tags, 1,000 branches. (Due to historical code fixups, another 5,000 "fix-up
>> branches" which are just one little dangling commit required to change the code
>> a little bit between a commit a tag that was not quite made from it.)
>
> I routinely work on a repo that's 1.9 GB packed, with 25k (and rapidly
> growing) refs.  Other developers work on a repo that's 9 GB packed, with
> somewhat fewer refs.  We don't tend to have problems with this.
>
> Obviously, performance is better on some of our smaller repos, but it's
> not unacceptable on the larger ones.  I generally find that the 940 KB
> repo with huge numbers of files performs worse than the 1.9 GB repo with
> somewhat fewer.  If you can split your repository into multiple logical
> repositories, that will certainly improve performance.
>
> If you end up having pain points, we're certainly interested in
> working through those.  I've brought up performance problems and people
> are generally responsive.
> --
> brian m. carlson / brian with sandals: Houston, Texas, US
> +1 832 623 2791 | http://www.crustytoothpaste.net/~bmc | My opinion only
> OpenPGP: RSA v4 4096b: 88AC E9B2 9196 305B A994 7552 F1BA 225C 0223 B187
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html