On Thu, Feb 19, 2015 at 10:26 PM, Stephen Morton <stephen.c.morton@xxxxxxxxx> wrote: > I posted this to comp.version-control.git.user and didn't get any response. I > think the question is plumbing-related enough that I can ask it here. > > I'm evaluating the feasibility of moving my team from SVN to git. We have a very > large repo. [1] We will have a central repo using GitLab (or similar) that > everybody works with. Forks, code sharing, pull requests etc. will be done > through this central server. > > By 'performance', I guess I mean speed of day to day operations for devs. > > * (Obviously, trivially, a (non-local) clone will be slow with a large repo.) > * Will a few simultaneous clones from the central server also slow down > other concurrent operations for other users? > * Will 'git pull' be slow? > * 'git push'? > * 'git commit'? (It is listed as slow in reference [3].) > * 'git stautus'? (Slow again in reference 3 though I don't see it.) > * Some operations might not seem to be day-to-day but if they are called > frequently by the web front-end to GitLab/Stash/GitHub etc then > they can become bottlenecks. (e.g. 'git branch --contains' seems terribly > adversely affected by large numbers of branches.) > * Others? > > > Assuming I can put lots of resources into a central server with lots of CPU, > RAM, fast SSD, fast networking, what aspects of the repo are most likely to > affect devs' experience? > * Number of commits > * Sheer disk space occupied by the repo > * Number of tags. > * Number of branches. > * Binary objects in the repo that cause it to bloat in size [1] > * Other factors? > > Of the various HW items listed above --CPU speed, number of cores, RAM, SSD, > networking-- which is most critical here? > > (Stash recommends 1.5 x repo_size x number of concurrent clones of > available RAM. > I assume that is good advice in general.) > > Assume ridiculous numbers. Let me exaggerate: say 1 million commits, 15 GB repo, > 50k tags, 1,000 branches. (Due to historical code fixups, another 5,000 "fix-up > branches" which are just one little dangling commit required to change the code > a little bit between a commit a tag that was not quite made from it.) > > While there's lots of information online, much of it is old [3] and with git > constantly evolving I don't know how valid it still is. Then there's anecdotal > evidence that is of questionable value.[2] > Are many/all of the issues Facebook identified [3] resolved? (Yes, I > understand Facebook went with Mercurial. But I imagine the git team nevertheless > took their analysis to heart.) Anecdotally I work on a repo at work (where I'm mostly "the Git guy") that's: * Around 500k commits * Around 100k tags * Around 5k branches * Around 500 commits/day, almost entirely to the same branch * 1.5 GB .git checkout. * Mostly text source, but some binaries (we're trying to cut down[1] on those) The main scaling issues we have with Git are: * "git pull" takes around 10 seconds or so * Operations like "git status" are much slower because they scale with the size of the work tree * Similarly "git rebase" takes a much longer time for each applied commit, I think because it does the equivalent of "git status" for every applied commit. Each commit applied takes around 1-2 seconds. * We have a lot of contention on pushes because we're mostly pushing to one branch. * History spelunking (e.g. git log --reverse -p -G<str>) is taking longer by the day The obvious reason for why "git pull" is slow is because git-upload-pack spews the complete set of refs at you each time. The output from that command is around 10MB in size for us now. It takes around 300 ms to run that locally from hot cache, a bit more to send it over the network. But actually most of "git fetch" is spent in the reachability check subsequently done by "git-rev-list" which takes several seconds. I haven't looked into it but there's got to be room for optimization there, surely it only has to do reachability checks for new refs, or could run in some "I trust this remote not to send me corrupt data" completely mode (which would make sense within a company where you can trust your main Git box). The "git status" operations could be made faster by having something like watchman, there's been some effort on getting that done in Git, but I haven't tried it. This seems to have been the main focus of Facebook's Mercurial optimization effort. Some of this you can "solve" mostly by doing e.g. "git status -uno", having support for such unsafe operations (e.g. teaching rebase and pals to use it) would be nice at the cost of some safety, but having something that feeds of inotify would be even better. It takes around 3 minutes to reclone our repo, we really don't care (we rarely re-clone). But I thought I'd mention it because for some reason this is important to Facebook and along with inotify were the two major things they focused on. As far as I know every day Git operations don't scale all that badly with a huge history. They will a bit since everything will live in the same pack file, and this'll become especially noticable when your packfiles are being evicted out of the page cache. However operations like "git repack" seem to be quite bad at handling these sort of repos. It already takes us many GB of RAM to repack ours. I'd hate to do the same if it was 10x as big. Overall I'd say Git would work for you for a repo like that, I'd certainly still take it over SVN any day. The main thing you might want to try out is partitioning out any binary assets you may have. Usually that's much easier than splitting up the source tree itself. I haven't yet done this, but I was planning on writing something to start archiving our tags (mostly created by [2]) along with aggressively deleting branches in the repo. I haven't benchmarked that but I think that'll make the "pull" operations much faster, which in turn will make the push contention (lots of lemmings pushing to the same ref) better since the pull && push window is reduced. I do ask myself what we're going to do if we just keep growing and all the numbers I cited will be multiplied by 10-50x. With the current Git limitations on the git implementation I think we'd need to split the repo. The main reason we don't do so is because we like being able to atomically change a library and its users. However there's nothing in the basic Git file repository format that inherently limits Git from being smarter about large repos, it just seems to be hard to implement with the way the current client is structured. In particular nothing would stop a Git client from: * Partially cloning a history but still being able to push upstream. You could just get a partial commit/tree/blob graph and fetch the rest on-demand as needed. * Scaling up to multi TB or PB repos. We'd just have to treat blobs as something fetched on-demand, sort of like what git-annex does, but built-in. We'd also have to be less stupid about how we pack big blobs (or not at all) * Being able to partially clone a Git working tree. You could ask the server for the last N commit objects and what you need for some subdirectory in the repo. Then when you commit you ask the server what the other top-level tree objects you need to make a commit are. * Nothing in the Git format itself actually requires filesystem access. Not having to deal with external things modifying the tree would be another approach to what the inotify effort is trying to solve. Of course changes like that will require a major rehaul of the current codebase, or another implementation. Some of those require much more active client/server interaction than what we have now, but they are possible, which gives me some hope for the future. Finally, I'd like to mention that if someone here on-list is interested in doing work on these scalability topics in Git we'd be open to funding that effort on some contract basis. Obviously the details would have to be worked out blah blah blah, but that came up the last time we had discussions about this internally. Myself and a bunch of other people at work /could/ work on this ourselves, but we're busy with other stuff and would much prefer just to pay someone to fix them. 1. https://github.com/avar/pre-receive-reject-binaries 2. https://github.com/git-deploy/git-deploy -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html