Re: Reducing CPU load on git server

"W. David Jarvis" <william.d.jarvis@xxxxxxxxx> · Mon, 29 Aug 2016 12:16:20 -0700

> So your load is probably really spiky, as you get thundering herds of
> fetchers after every push (the spikes may have a long flatline at the
> top, as it takes time to process the whole herd).

It is quite spiky, yes. At the moment, however, the replication fleet
is relatively small (at the moment it's just 4 machines). We had 6
machines earlier this month and we had hoped that terminating two of
them would lead to a material drop in CPU usage, but we didn't see a
really significant reduction.

> Yes, though I'd be surprised if this negotiation is that expensive in
> practice. In my experience it's not generally, and even if we ended up
> traversing every commit in the repository, that's on the order of a few
> seconds even for large, active repositories.
>
> In my experience, the problem in a mass-fetch like this ends up being
> pack-objects preparing the packfile. It has to do a similar traversal,
> but _also_ look at all of the trees and blobs reachable from there, and
> then search for likely delta-compression candidates.
>
> Do you know which processes are generating the load? git-upload-pack
> does the negotiation, and then pack-objects does the actual packing.

When I look at expensive operations (ones that I can see consuming
90%+ of a CPU for more than a second), there are often pack-objects
processes running that will consume an entire core for multiple
seconds (I also saw one pack-object counting process run for several
minutes while using up a full core). rev-list shows up as a pretty
active CPU consumer, as do prune and blame-tree.

I'd say overall that in terms of high-CPU consumption activities,
`prune` and `rev-list` show up the most frequently.

On the subject of prune - I forgot to mention that the `git fetch`
calls from the subscribers are running `git fetch --prune`. I'm not
sure if that changes the projected load profile.

> Maybe. If pack-objects is where your load is coming from, then
> counter-intuitively things sometimes get _worse_ as you fetch less. The
> problem is that git will generally re-use deltas it has on disk when
> sending to the clients. But if the clients are missing some of the
> objects (because they don't fetch all of the branches), then we cannot
> use those deltas and may need to recompute new ones. So you might see
> some parts of the fetch get cheaper (negotiation, pack-object's
> "Counting objects" phase), but "Compressing objects" gets more
> expensive.

I might be misunderstanding this, but if the subscriber is already "up
to date" modulo a single updated ref tip, then this problem shouldn't
occur, right? Concretely: if ref A is built off of ref B, and the
subscriber already has B when it requests A, that shouldn't cause this
behavior, but it would cause this behavior if the subscriber didn't
have B when it requested A.

> This is particularly noticeable with shallow fetches, which in my
> experience are much more expensive to serve.

I don't think we're doing shallow fetches anywhere in this system.

> Jakub mentioned bitmaps, and if you are using GitHub Enterprise, they
> are enabled. But they won't really help here. They are essentially
> cached information that git generates at repack time. But if we _just_
> got a push, then the new objects to fetch won't be part of the cache,
> and we'll fall back to traversing them as normal.  On the other hand,
> this should be a relatively small bit of history to traverse, so I'd
> doubt that "Counting objects" is that expensive in your case (but you
> should be able to get a rough sense by watching the progress meter
> during a fetch).

See comment above about a long-running counting objects process. I
couldn't tell which of our repositories it was counting, but it went
for about 3 minutes with full core utilization. I didn't see us
counting pack-objects frequently but it's an expensive operation.

> I'd suspect more that delta compression is expensive (we know we just
> got some new objects, but we don't know if we can make good deltas
> against the objects the client already has). That's a gut feeling,
> though.
>
> If the fetch is small, that _also_ shouldn't be too expensive. But
> things add up when you have a large number of machines all making the
> same request at once. So it's entirely possible that the machine just
> gets hit with a lot of 5s CPU tasks all at once. If you only have a
> couple cores, that takes many multiples of 5s to clear out.

I think this would show up if I was sitting around running `top` on
the machine, but that doesn't end up being what I see. That might just
be a function of there being a relatively small number of replication
machines, I'm not sure. But I'm not noticing 4 of the same tasks get
spawned simultaneously, which says to me that we're either utilizing a
cache or there's some locking behavior involved.

> There's nothing in upstream git to help smooth these loads, but since
> you mentioned GitHub Enterprise, I happen to know that it does have a
> system for coalescing multiple fetches into a single pack-objects. I
> _think_ it's in GHE 2.5, so you might check which version you're
> running (and possibly also talk to GitHub Support, who might have more
> advice; there are also tools for finding out which git processes are
> generating the most load, etc).

We're on 2.6.4 at the moment.

> I suspect there's room for improvement and tuning of the primary. But
> barring that, one option would be to have a hierarchy of replicas. Have
> "k" first-tier replicas fetch from the primary, then "k" second-tier
> replicas fetch from them, and so on. Trade propagation delay for
> distributing the load. :)

Yep, this was the first thought we had. I started fleshing out the
architecture for it but wasn't sure if the majority of the CPU load
was being triggered by the first request to make it in, in which case
moving to a multi-tiered architecture wouldn't help us. When we didn't
notice a material reduction in CPU load after reducing the replication
fleet by 1/3, I started to worry about this and stopped moving forward
on the multi-tiered architecture until I understood the behavior
better.

 - V