Re: Git Scaling: What factors most affect Git performance for a large repo?

Ævar Arnfjörð Bjarmason <avarab@xxxxxxxxx> · Fri, 20 Feb 2015 00:29:30 +0100

On Thu, Feb 19, 2015 at 10:26 PM, Stephen Morton
<stephen.c.morton@xxxxxxxxx> wrote:
> I posted this to comp.version-control.git.user and didn't get any response. I
> think the question is plumbing-related enough that I can ask it here.
>
> I'm evaluating the feasibility of moving my team from SVN to git. We have a very
> large repo. [1] We will have a central repo using GitLab (or similar) that
> everybody works with. Forks, code sharing, pull requests etc. will be done
> through this central server.
>
> By 'performance', I guess I mean speed of day to day operations for devs.
>
>    * (Obviously, trivially, a (non-local) clone will be slow with a large repo.)
>    * Will a few simultaneous clones from the central server also slow down
>      other concurrent operations for other users?
>    * Will 'git pull' be slow?
>    * 'git push'?
>    * 'git commit'? (It is listed as slow in reference [3].)
>    * 'git stautus'? (Slow again in reference 3 though I don't see it.)
>    * Some operations might not seem to be day-to-day but if they are called
>      frequently by the web front-end to GitLab/Stash/GitHub etc then
>      they can become bottlenecks. (e.g. 'git branch --contains' seems terribly
>      adversely affected by large numbers of branches.)
>    * Others?
>
>
> Assuming I can put lots of resources into a central server with lots of CPU,
> RAM, fast SSD, fast networking, what aspects of the repo are most likely to
> affect devs' experience?
>    * Number of commits
>    * Sheer disk space occupied by the repo
>    * Number of tags.
>    * Number of branches.
>    * Binary objects in the repo that cause it to bloat in size [1]
>    * Other factors?
>
> Of the various HW items listed above --CPU speed, number of cores, RAM, SSD,
> networking-- which is most critical here?
>
> (Stash recommends 1.5 x repo_size x number of concurrent clones of
> available RAM.
> I assume that is good advice in general.)
>
> Assume ridiculous numbers. Let me exaggerate: say 1 million commits, 15 GB repo,
> 50k tags, 1,000 branches. (Due to historical code fixups, another 5,000 "fix-up
> branches" which are just one little dangling commit required to change the code
> a little bit between a commit a tag that was not quite made from it.)
>
> While there's lots of information online, much of it is old [3] and with git
> constantly evolving I don't know how valid it still is. Then there's anecdotal
> evidence that is of questionable value.[2]
>     Are many/all of the issues Facebook identified [3] resolved? (Yes, I
> understand Facebook went with Mercurial. But I imagine the git team nevertheless
> took their analysis to heart.)

Anecdotally I work on a repo at work (where I'm mostly "the Git guy") that's:

 * Around 500k commits
 * Around 100k tags
 * Around 5k branches
 * Around 500 commits/day, almost entirely to the same branch
 * 1.5 GB .git checkout.
 * Mostly text source, but some binaries (we're trying to cut down[1] on those)

The main scaling issues we have with Git are:

 * "git pull" takes around 10 seconds or so
 * Operations like "git status" are much slower because they scale
with the size of the work tree
 * Similarly "git rebase" takes a much longer time for each applied
commit, I think because it does the equivalent of "git status" for
every applied commit. Each commit applied takes around 1-2 seconds.
 * We have a lot of contention on pushes because we're mostly pushing
to one branch.
 * History spelunking (e.g. git log --reverse -p -G<str>) is taking
longer by the day

The obvious reason for why "git pull" is slow is because
git-upload-pack spews the complete set of refs at you each time. The
output from that command is around 10MB in size for us now. It takes
around 300 ms to run that locally from hot cache, a bit more to send
it over the network.

But actually most of "git fetch" is spent in the reachability check
subsequently done by "git-rev-list" which takes several seconds. I
haven't looked into it but there's got to be room for optimization
there, surely it only has to do reachability checks for new refs, or
could run in some "I trust this remote not to send me corrupt data"
completely mode (which would make sense within a company where you can
trust your main Git box).

The "git status" operations could be made faster by having something
like watchman, there's been some effort on getting that done in Git,
but I haven't tried it. This seems to have been the main focus of
Facebook's Mercurial optimization effort.

Some of this you can "solve" mostly by doing e.g. "git status -uno",
having support for such unsafe operations (e.g. teaching rebase and
pals to use it) would be nice at the cost of some safety, but having
something that feeds of inotify would be even better.

It takes around 3 minutes to reclone our repo, we really don't care
(we rarely re-clone). But I thought I'd mention it because for some
reason this is important to Facebook and along with inotify were the
two major things they focused on.

As far as I know every day Git operations don't scale all that badly
with a huge history. They will a bit since everything will live in the
same pack file, and this'll become especially noticable when your
packfiles are being evicted out of the page cache.

However operations like "git repack" seem to be quite bad at handling
these sort of repos. It already takes us many GB of RAM to repack
ours. I'd hate to do the same if it was 10x as big.

Overall I'd say Git would work for you for a repo like that, I'd
certainly still take it over SVN any day. The main thing you might
want to try out is partitioning out any binary assets you may have.
Usually that's much easier than splitting up the source tree itself.

I haven't yet done this, but I was planning on writing something to
start archiving our tags (mostly created by [2]) along with
aggressively deleting branches in the repo. I haven't benchmarked that
but I think that'll make the "pull" operations much faster, which in
turn will make the push contention (lots of lemmings pushing to the
same ref) better since the pull && push window is reduced.

I do ask myself what we're going to do if we just keep growing and all
the numbers I cited will be multiplied by 10-50x. With the current Git
limitations on the git implementation I think we'd need to split the
repo. The main reason we don't do so is because we like being able to
atomically change a library and its users.

However there's nothing in the basic Git file repository format that
inherently limits Git from being smarter about large repos, it just
seems to be hard to implement with the way the current client is
structured. In particular nothing would stop a Git client from:

 * Partially cloning a history but still being able to push upstream.
You could just get a partial commit/tree/blob graph and fetch the rest
on-demand as needed.
 * Scaling up to multi TB or PB repos. We'd just have to treat blobs
as something fetched on-demand, sort of like what git-annex does, but
built-in. We'd also have to be less stupid about how we pack big blobs
(or not at all)
 * Being able to partially clone a Git working tree. You could ask the
server for the last N commit objects and what you need for some
subdirectory in the repo. Then when you commit you ask the server what
the other top-level tree objects you need to make a commit are.
 * Nothing in the Git format itself actually requires filesystem
access. Not having to deal with external things modifying the tree
would be another approach to what the inotify effort is trying to
solve.

Of course changes like that will require a major rehaul of the current
codebase, or another implementation. Some of those require much more
active client/server interaction than what we have now, but they are
possible, which gives me some hope for the future.

Finally, I'd like to mention that if someone here on-list is
interested in doing work on these scalability topics in Git we'd be
open to funding that effort on some contract basis. Obviously the
details would have to be worked out blah blah blah, but that came up
the last time we had discussions about this internally. Myself and a
bunch of other people at work /could/ work on this ourselves, but
we're busy with other stuff and would much prefer just to pay someone
to fix them.

1. https://github.com/avar/pre-receive-reject-binaries
2. https://github.com/git-deploy/git-deploy
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html