On Thu, Feb 19, 2015 at 04:26:58PM -0500, Stephen Morton wrote: > I posted this to comp.version-control.git.user and didn't get any response. I > think the question is plumbing-related enough that I can ask it here. > > I'm evaluating the feasibility of moving my team from SVN to git. We have a very > large repo. [1] We will have a central repo using GitLab (or similar) that > everybody works with. Forks, code sharing, pull requests etc. will be done > through this central server. > > By 'performance', I guess I mean speed of day to day operations for devs. > > * (Obviously, trivially, a (non-local) clone will be slow with a large repo.) > * Will a few simultaneous clones from the central server also slow down > other concurrent operations for other users? This hasn't been a problem for us at $DAYJOB. Git doesn't lock anything on fetches, so each process is independent. We probably have about sixty developers (and maybe twenty other occasional users) that manage to interact with our Git server all day long. We also have probably twenty smoker (CI) systems pulling at two hour intervals, or, when there's nothing to do, every two minutes, plus probably fifteen to twenty build systems pulling hourly. I assume you will provide adequate resources for your server. > * Will 'git pull' be slow? > * 'git push'? The most pathological case I've seen for git push is a branch with a single commit merged into the main development branch. As of Git 2.3.0, the performance regression here is fixed. Obviously, the speed of your network connection will affect this. Even at 30 MB/s, cloning several gigabytes of data takes time. Git tries hard to eliminate sending a lot of data, so if your developers keep reasonably up-to-date, the cost of establishing the connection will tend to dominate. I see pull and push times that are less than 2 seconds in most cases. > * 'git commit'? (It is listed as slow in reference [3].) > * 'git stautus'? (Slow again in reference 3 though I don't see it.) These can be slow with slow disks or over remote file systems. I recommend not doing that. I've heard rumbles that disk performance is better on Unix, but I don't use Windows so I can't say. You should keep your .gitignore files up-to-date to avoid enumerating untracked files. There's some work towards making this less of an issue. git blame can be somewhat slow, but it's not something I use more than about once a day, so it doesn't bother me that much. > Assuming I can put lots of resources into a central server with lots of CPU, > RAM, fast SSD, fast networking, what aspects of the repo are most likely to > affect devs' experience? > * Number of commits > * Sheer disk space occupied by the repo The number of files can impact performance due to the number of stat()s required. > * Number of tags. > * Number of branches. The number of tags and branches individually is really less relevant than the total number of refs (tags, branches, remote branches, etc). Very large numbers of refs can impact performance on pushes and pulls due to the need to enumerate them all. > * Binary objects in the repo that cause it to bloat in size [1] > * Other factors? If you want good performance, I'd recommend the latest version of Git both client- and server-side. Newer versions of Git provide pack bitmaps, which can dramatically speed up clones and fetches, and Git 2.3.0 fixes a performance regression with large numbers of refs in non-shallow repositories. It is totally worth it to roll your own packages of git if your vendor provides old versions. > Of the various HW items listed above --CPU speed, number of cores, RAM, SSD, > networking-- which is most critical here? I generally find that having a good disk cache is important with large repositories. It may be advantageous to make sure the developer machines have adequate memory. Performance is notably better on development machines (VMs) with 2 GB or 4 GB of memory instead of 1 GB. I can't speak to the server side, as I'm not directly involved with its deployment. > Assume ridiculous numbers. Let me exaggerate: say 1 million commits, 15 GB repo, > 50k tags, 1,000 branches. (Due to historical code fixups, another 5,000 "fix-up > branches" which are just one little dangling commit required to change the code > a little bit between a commit a tag that was not quite made from it.) I routinely work on a repo that's 1.9 GB packed, with 25k (and rapidly growing) refs. Other developers work on a repo that's 9 GB packed, with somewhat fewer refs. We don't tend to have problems with this. Obviously, performance is better on some of our smaller repos, but it's not unacceptable on the larger ones. I generally find that the 940 KB repo with huge numbers of files performs worse than the 1.9 GB repo with somewhat fewer. If you can split your repository into multiple logical repositories, that will certainly improve performance. If you end up having pain points, we're certainly interested in working through those. I've brought up performance problems and people are generally responsive. -- brian m. carlson / brian with sandals: Houston, Texas, US +1 832 623 2791 | http://www.crustytoothpaste.net/~bmc | My opinion only OpenPGP: RSA v4 4096b: 88AC E9B2 9196 305B A994 7552 F1BA 225C 0223 B187
Attachment:
signature.asc
Description: Digital signature