avoiding false detection of down OSDs

Gregory Farnum <greg@xxxxxxxxxxx> · Mon, 30 Jul 2012 11:46:25 -0700

As Ceph gets deployed on larger clusters our most common scaling
issues have related to
1) our heartbeat system, and
2) handling the larger numbers of OSDMaps that get generated by
increases in the OSD (failures, boots, etc) and PG count (osd
up-thrus, pg_temp insertions, etc).

Lately we haven't had many issues with heartbeats when the OSDs are
happy, so it looks like the latest respin of the heartbeat code is
going to satisfy us going forward. Fast OSD map generation continues
to be a concern, but with the merge of Sam's new map handling code
recently (which reduced the amount of disk effort required to process
a map and shuffled responsibility out of the main OSD thread and into
the more highly-threaded PGs) it has become significantly less
expensive, and we have a number of implemented and planned changes
(from the short- to the long-term) to continue making it less painful.

However, we've started seeing a new issue at the intersection of these
separate problems: what happens when an OSD slows down because it's
processing too many maps, but continues to operate. In large clusters,
an OSD might go down and come back up with hundreds-to-thousands of
maps to process — often at the same time as other OSDs. We've started
to observe issues during software upgrades where a lot of OSDs come up
together and process so many maps that they run out of memory and
start swapping[1]. This can easily cause them to miss heartbeats long
enough to get marked down — but then they finish map processing, tell
the monitors they *are* alive, and get marked back up. This sequence
can cause so many new maps to generate that it repeats itself on the
new nodes, spreads to other nodes in the cluster, or even causes some
running OSD daemons to get marked out. We've taken to calling this
"OSD thrashing".

It would be great if we could come up with a systemic way to reduce
thrashing, independent from our efforts to reduce the triggering
factors. (For one thing, when only one node is thrashing we probably
want to mark it down to preserve performance, whereas when half the
cluster is thrashing we want to keep them up to reduce cluster-wide
load increases.) A few weeks ago some of us at Inktank had a meeting
to discuss the issue, and I've finally gotten around to writing it up
in this email so that we can ask for input from the wider community!

After discussing several approaches (including scaling heartbeat
intervals as more nodes are marked down, as nodes report being wrongly
marked down, putting caps on the number of nodes that can be auto
marked down and/or out, applying rate limiters to the auto-marks,
etc), we realized that what we really wanted was to do our best to
estimate the chances that an OSD which missed its heartbeat window was
simply laggy rather than being down.
While long-term I'm a proponent of pushing most of this heartbeat
handling logic to the OSDs, in the short term adjustments to the
algorithm are much easier to implement in the monitor (which has a lot
more state on the cluster already local). We came up with a broad
algorithm to estimate the chance that an OSD is laggy instead of down:
first, figure out the probability that the OSD is down based on its
past history, and then figure out that probability for the cluster
that the OSD belongs to.
Basically:
1) Keep track of when an OSD boots if it reports itself as fresh or as
wrongly-marked-down. Maintain the probability that the OSD is actually
down versus laggy based on that data and an exponential decay (more
recent reports matter more), and maintain the length of time the OSD
was laggy for in those cases.
2) When a sufficient number of failure reports come in to mark an OSD
down, additionally compute the laggy probability and laggy interval
for the reporters in aggregate.
3) Adjust the "heartbeat grace" locally on the monitor according to
the following formula:
    adjusted_heartbeat_grace = heartbeat_grace + laggy_interval * (1 /
laggy_probability) + group_laggy_interval * ( 1 /
group_laggy_probability)
4) If we reach the end of that adjusted heartbeat grace, and we have
not received failure cancellations (which already exist; when an OSD
gets a heartbeat from a node it's reported down but which isn't marked
down, the OSD sends a cancellation), then mark the OSD down.
5) When running the out check, adjust the "down to out interval" by
the same ratio we've adjusted the heartbeat grace by.

This algorithm has several nice properties:
1) It allows us to independently account for both the probability that
the node is laggy, and for the length of time the node is usually
laggy for.
2) It localizes lagginess by PG relationships — if your Ceph cluster
has multiple pools stored in different locations, lagginess won't
cross those boundaries.
3) It's not too expensive, and by framing it the way we have (in terms
of estimating probabilities) we can shuffle the generic algorithm
around (eg, eventually move these calculations to the reporting OSDs).
There are a couple of things it doesn't do:
1) It doesn't do a good job of noticing that a particular rack is
laggy compared to other racks within the same pool.
2) It's all continuous — there isn't yet any sense of "don't guess
anybody is laggy until we've seen a certain amount of churn over the
last x minutes".

We think that this is a good start and that any necessary
modifications will be pretty easy to add, but if you have other ideas
or critiques we'd love to hear about them!
-Greg

[1]: And we are doing a lot of work to reduce memory consumption, but
while that can delay the problem it can't fix it.
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html