Re: Reducing CPU load on git server

Jeff King <peff@xxxxxxxx> · Mon, 29 Aug 2016 13:18:49 -0400

On Mon, Aug 29, 2016 at 12:46:27PM +0200, Jakub Narębski wrote:

> > So your load is probably really spiky, as you get thundering herds of
> > fetchers after every push (the spikes may have a long flatline at the
> > top, as it takes time to process the whole herd).
> 
> One solution I have heard about, in the context of web cache, to reduce
> the thundering herd problem (there caused by cache expiring at the same
> time in many clients) was to add some random or quasi-random distribution
> to expiration time.  In your situation adding a random delay with some
> specified deviation could help.

That smooths the spikes, but you still have to serve all of the requests
eventually. So if your problem is that the load spikes and the system
slows to a crawl as a result (or runs out of RAM, etc), then
distributing the load helps. But if you have enough load that your
system is constantly busy, queueing the load in a different order just
shifts it around.

GHE will also introduce delays into starting git when load spikes, but
that's a separate system that coalescing identical requests.

> I wonder if this system for coalescing multiple fetches is something
> generic, or is it something specific to GitHub / GitHub Enterprise
> architecture?  If it is the former, would it be considered for
> upstreaming, and if so, when it would be in Git itself?

I've already sent upstream the patch for a "hook" that sits between
upload-pack and pack-objects (and it will be in v2.10). So that can call
an arbitrary script which can then make scheduling policy for
pack-objects, coalesce similar requests, etc.

GHE has a generic tool for coalescing program invocations that is not
Git-specific at all (it compares its stdin and command line arguments to
decide when two requests are identical, runs the command on its
arguments, and then passes the output to all members of the herd). That
_might_ be open-sourced in the future, but I don't have a specific
timeline.

> One thing to note: if you have repositories which are to have the
> same contents, you can distribute the pack-file to them and update
> references without going through Git.  It can be done on push
> (push to master, distribute to mirrors), or as part of fetch
> (master fetches from central repository, distributes to mirrors).
> I think; I have never managed large set of replicated Git repositories.

Doing it naively has some gotchas, because you want to make sure you
have all of the necessary objects. But if you are going this route,
probably distributed a git-bundle is the simplest way.

> > Generally no, they should not conflict. Writes into the object database
> > can happen simultaneously. Ref updates take a per-ref lock, so you
> > should generally be able to write two unrelated refs at once. The big
> > exception is that ref deletion required taking a repo-wide lock, but
> > that presumably wouldn't be a problem for your case.
> 
> Doesn't Git avoid taking locks, and use lockless synchronization
> mechanisms (though possibly equivalent to locks)?  I think it takes
> lockfile to update reflog together with reference, but if reflogs
> are turned off (and I think they are off for bare repositories by
> default), ref update uses "atomic file write" (write + rename)
> and compare-and-swap primitive.  Updating repository is lock-free:
> first update repository object database, then reference.

There is a lockfile to make the compare-and-swap atomic, but yes, it's
fundamentally based around the compare-and-swap. I don't think that
matters to the end user though. Fundamentally they will see "I hoped to
move from X to Y, but somebody else wrote Z, aborting", which is the
same as "I did not win the lock race, aborting".

The point is that updating two different refs is generally independent,
and updating the same ref is not.

> I guess that trying to replicate DGit approach that GitHub uses, see
> "Introducing DGit" (http://githubengineering.com/introducing-dgit)
> is currently out of question?

Minor nitpick (that you don't even have any way of knowing about, so
maybe more of a public service announcement). GitHub will stop using the
"DGit" name because it's too confusingly similar to "Git" (and "Git" is
trademarked by the project). There's a new blog post coming that
mentions the name change, and that historic one will have a note added
retroactively. The new name is "GitHub Spokes" (get it, Hub, Spokes?).

But in response to your question, I'll caution that replicating it is a
lot of work. :)

Since the original problem report mentions GHE, I'll note that newer
versions of GHE do support clustering and can share the git load across
multiple Spokes servers. So in theory that could make the replica layer
go away entirely, because it all happens behind the scenes.

-Peff

PS Sorry, I generally try to avoid hawking GitHub wares on the list, but
   since the OP mentioned GHE specifically, and because there aren't
   really generic solutions to most of these things, I do think it's a
   viable path for a solution for him.