Re: Git Scaling: What factors most affect Git performance for a large repo?

Ævar Arnfjörð Bjarmason <avarab@xxxxxxxxx> · Fri, 20 Feb 2015 15:25:59 +0100

On Fri, Feb 20, 2015 at 1:09 PM, Ævar Arnfjörð Bjarmason
<avarab@xxxxxxxxx> wrote:
> On Fri, Feb 20, 2015 at 1:04 AM, Duy Nguyen <pclouds@xxxxxxxxx> wrote:
>> On Fri, Feb 20, 2015 at 6:29 AM, Ævar Arnfjörð Bjarmason
>> <avarab@xxxxxxxxx> wrote:
>>> Anecdotally I work on a repo at work (where I'm mostly "the Git guy") that's:
>>>
>>>  * Around 500k commits
>>>  * Around 100k tags
>>>  * Around 5k branches
>>>  * Around 500 commits/day, almost entirely to the same branch
>>>  * 1.5 GB .git checkout.
>>>  * Mostly text source, but some binaries (we're trying to cut down[1] on those)
>>
>> Would be nice if you could make an anonymized version of this repo
>> public. Working on a "real" large repo is better than an artificial
>> one.
>
> Yeah, I'll try to do that.

tl;dr: After some more testing it turns out the performance issues we
have are almost entirely due to the number of refs. Some of these I
knew about and were obvious (e..g. git pull), but some aren't so
obvious (why does "git log" without "--all" slow down as a function of
the overall number of refs?).

Rather than getting an anonymized version of the repo we have, a
simpler isolated test case is just doing this on linux.git:

    $ git rev-list --all | perl -ne 'my $cnt; while (<>) {
s<([a-f0-9]+)><git tag -a -m"Test" TAG $1>gm; next unless int rand 10
== 1; $cnt++; s/TAG/tagnr-$cnt/; print }'  | sh -x

That'll create a tag for every 10th commit or so, which is around 50k
tags for linux.git.

I actually ran this a few times while testing it, so this is a before
and after on a hot cache of linux.git with 406 tags v.s. ~140k. I ran
the gc + repack + bitmaps for both repos noted in an earlier reply of
mine, and took the fastest run out of 3:

    $ time (git log master -100 >/dev/null)
    Before: real    0m0.021s
    After: real    0m2.929s
    $ time (git status >/dev/null)
    # Around 150ms, no noticeable difference
    $ time git fetch
    # I'm fetching from git@xxxxxxxxxx:torvalds/linux.git here, the
    # cache is hot but upstream has *no* changes
    Before: real    0m1.826s
    After: real    0m8.458s

Details on why "git fetch" is slow in this situation:

    $ time GIT_TRACE=1 git fetch
    15:15:00.435420 git.c:349               trace: built-in: git 'fetch'
    15:15:00.654428 run-command.c:341       trace: run_command: 'ssh'
'git@xxxxxxxxxx' 'git-upload-pack '\''torvalds/linux.git'\'''
    15:15:02.426121 run-command.c:341       trace: run_command:
'rev-list' '--objects' '--stdin' '--not' '--all' '--quiet'
    15:15:05.507327 run-command.c:341       trace: run_command:
'rev-list' '--objects' '--stdin' '--not' '--all'
    15:15:05.508329 exec_cmd.c:134          trace: exec: 'git'
'rev-list' '--objects' '--stdin' '--not' '--all'
    15:15:05.510490 git.c:349               trace: built-in: git
'rev-list' '--objects' '--stdin' '--not' '--all'
    15:15:08.874116 run-command.c:341       trace: run_command: 'gc' '--auto'
    15:15:08.879570 exec_cmd.c:134          trace: exec: 'git' 'gc' '--auto'
    15:15:08.882495 git.c:349               trace: built-in: git 'gc' '--auto'
    real    0m8.458s
    user    0m6.548s
    sys     0m0.204s

Even things you'd expect to not be impacted are, like a reverse log
search on the master branch:

    $ time (git log --reverse -p --grep=arm64 origin/master >/dev/null)
    Before: real    0m4.473s
    After: real    0m6.194s

Or doing 10 commits and rebasing on the upstream:

    $ time (git checkout origin/master~ && for i in {1..10}; do echo
$i > file && git add file && git commit -m"moo" $file; done && git
rebase origin/master)
    Before: real    0m6.798s
    After: real    0m12.340s

The remaining slowdown comes from the size of the tree, which we can
deal with by either reducing it in size (we have some copied JS
libraries and whatnot) or trying the inotify-powered git-status.

In our case there's no good reason for why we have this many refs in
the repository everyone uses. We basically just have a bunch of dated
rollout tags that have been accumulating for years, and a bunch of
mostly unused branches people just haven't cleaned up.

So I'm going to:

 1. Write a hook that rejects tags that aren't new (i.e. forbid
re-pushes of old tags)
 2. Create an archive repository that contains all the old tags (i.e.
just run "git fetch" on the main one from cron)
 3. Run a script to regularly delete tags from the main repo
 4. Run the same script on the clients that clone the repo

The branches are slightly harder, deleting those that are fully merged
into the same branch is easy, deleting those whose contents 100%
matches patch-id's already in the main branch is another thing we can
do, and just clean up branches unconditionally after they've reached a
certain age (they'll still be archived).
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html