Reducing CPU load on git server

"W. David Jarvis" <william.d.jarvis@xxxxxxxxx> · Sun, 28 Aug 2016 12:42:52 -0700

Hi all -

I've run into a problem that I'm looking for some help with. Let me
describe the situation, and then some thoughts.

The company I work for uses git. We use GitHub Enterprise as a
frontend for our primary git server. We're using Chef solo to manage a
fleet of upwards of 10,000 hosts, which regularly pull from our Chef
git repository so as to converge.

A few years ago, in order to reduce the load on the primary server,
the firm set up a fleet of replication servers. This allowed the
majority of our infrastructure to target the replication servers for
all pull-based activity rather than the primary git server.

The actual replication process works as follows:

1. The primary git server receives a push and sends a webhook with the
details of the push (repo, ref, sha, some metadata) to a "publisher"
box

2. The publisher enqueues the details of the webhook into a queue

3. A fleet of "subscriber" (replica) boxes each reads the payload of
the enqueued message. Each of these then tries to either clone the
repository if they don't already have it, or they run `git fetch`.

During the course of either of these operations we use a
repository-level lockfile. If another request comes in while the repo
is locked, we re-enqueue it. When that request comes in later, if the
push event time is earlier than the most recent successful fetch, we
don't do any further work.

---

The problem we've been running into has been that as we've continued
to add developers, the CPU load on our primary instance has gone
through the roof. We've upgraded the machine to some of the punchiest
hardware we can get and it's regularly exceeding 70% CPU load. We know
that the overwhelming drive of this CPU load is near-constant git
operations on some of our larger and more active repositories.

Migrating off of Chef solo is essentially a non-starter at this point
in time, and we've also added a considerable number of non-Chef git
dependencies such that getting rid of the replication service would be
a massive undertaking. Even if we were to kill the replication fleet,
we'd have to figure out where to point that traffic in such a way that
we didn't overwhelm the primary git server anyways.

So I'm looking for some help in figuring out what to do next. At the
moment, I have two thoughts:

1. We currently run a blanket `git fetch` rather than specifically
fetching the ref that was pushed. My understanding from poking around
the git source code is that this causes the replication server to send
a list of all of its ref tips to the primary server, and the primary
server then has to verify and compare each of these tips to the ref
tips residing on the server.

That might not be a ton of work to do on an individual fetch
operation, but some of our repositories have over 5,000 branches and
are pushed to 1,000 times a day. Algorithmically, this would suggest
that the cost of a fetch will go up in terms of both N -- branches and
M -- pushes, so we're talking about a cost of N*M, both of which will
increase when we hire developers. This implies exponential growth in
load as we add engineers, which is...not good.

Also worth noting - if we assume that the replication server should be
reasonably up-to-date (see lockfile logic description, above), we're
talking about typically packing objects for one ref, and at most a
small number (<5).

My hypothesis is that moving to fetching the specific branch rather
than doing a blanket fetch would have a significant and material
impact on server load.

If we do go down this route, we'll potentially need to do some
refactoring around how we handle "failed fetches", which relates both
to our locking logic and to the actual potential failure of a git
fetch. Discussion of the locking mechanism follows:

2. Our current locking mechanism seems to me to be un-necessary. I'm
aware that git uses a few different locking mechanisms internally, and
the use of a repo-level lockfile would seem to only guarantee that
we're using coarser-grained locking than git actually supports. But I
don't know if git supports specific fetch-level ref locking that would
permit concurrent ref fetch operations. If it does, our current
architecture would seem to prevent git from being able to take
advantage of those mechanisms.

In other words, let's imagine a world in which we ditch our current
repo-level locking mechanism entirely. Let's also presume we move to
fetching specific refs rather than using blanket fetches. Does that
mean that if a fetch for ref A and a fetch for ref B are issued at
roughly the exact same time, the two will be able to be executed at
once without running into some git-internal locking mechanism on a
granularity coarser than the ref? i.e. are fetch A and fetch B going
to be blocked on the other's completion in any way? (let's presume
that ref A and ref B are not parents of each other).

---

I'm neither a contributor nor an expert in git, so my inferences thus
far as based purely off of what I would describe as "stumbling around
the source code like a drunken baby".

The ultimate goal for us is just figuring out how we can best reduce
the CPU load on the primary instance so that we don't find ourselves
in a situation where we're not able to run basic git operations
anymore.

If I'm barking up the wrong tree, or if there are other optimizations
we should be considering, I'd be eager to learn about those as well.
Of course, if what I'm describing sounds about right, I'd like
confirmation of that from some people who actually know what they're
talking about (i.e., not me :) ).

Thanks.

 - Venantius

-- 
============
venanti.us
203.918.2328
============
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html