Re: avoiding false detection of down OSDs

"Jim Schutt" <jaschut@xxxxxxxxxx> · Mon, 30 Jul 2012 16:47:12 -0600

Hi Greg,

Thanks for the write-up.  I have a couple questions below.

On 07/30/2012 12:46 PM, Gregory Farnum wrote:
As Ceph gets deployed on larger clusters our most common scaling
issues have related to
1) our heartbeat system, and
2) handling the larger numbers of OSDMaps that get generated by
increases in the OSD (failures, boots, etc) and PG count (osd
up-thrus, pg_temp insertions, etc).

Lately we haven't had many issues with heartbeats when the OSDs are
happy, so it looks like the latest respin of the heartbeat code is
going to satisfy us going forward. Fast OSD map generation continues
to be a concern, but with the merge of Sam's new map handling code
recently (which reduced the amount of disk effort required to process
a map and shuffled responsibility out of the main OSD thread and into
the more highly-threaded PGs) it has become significantly less
expensive, and we have a number of implemented and planned changes
(from the short- to the long-term) to continue making it less painful.

However, we've started seeing a new issue at the intersection of these
separate problems: what happens when an OSD slows down because it's
processing too many maps, but continues to operate. In large clusters,
an OSD might go down and come back up with hundreds-to-thousands of
maps to process — often at the same time as other OSDs. We've started
to observe issues during software upgrades where a lot of OSDs come up
together and process so many maps that they run out of memory and
start swapping[1]. This can easily cause them to miss heartbeats long
enough to get marked down — but then they finish map processing, tell
the monitors they *are* alive, and get marked back up. This sequence
can cause so many new maps to generate that it repeats itself on the
new nodes, spreads to other nodes in the cluster, or even causes some
running OSD daemons to get marked out. We've taken to calling this
"OSD thrashing".

It would be great if we could come up with a systemic way to reduce
thrashing, independent from our efforts to reduce the triggering
factors. (For one thing, when only one node is thrashing we probably
want to mark it down to preserve performance, whereas when half the
cluster is thrashing we want to keep them up to reduce cluster-wide
load increases.) A few weeks ago some of us at Inktank had a meeting
to discuss the issue, and I've finally gotten around to writing it up
in this email so that we can ask for input from the wider community!

After discussing several approaches (including scaling heartbeat
intervals as more nodes are marked down, as nodes report being wrongly
marked down, putting caps on the number of nodes that can be auto
marked down and/or out, applying rate limiters to the auto-marks,
etc), we realized that what we really wanted was to do our best to
estimate the chances that an OSD which missed its heartbeat window was
simply laggy rather than being down.

I don't understand the functional difference between an OSD that
is too busy to process its heartbeat in a timely fashion, and
one that is down.  In either case, it cannot meet its obligations
to its peers.

I understand that wrongly marking an OSD down adds unnecessary map
processing work.  Also, if an OSD is wrongly marked down then any
data that would be written to it while it is marked down will be
written to other OSDs, and will need to be migrated when that OSD
is marked back up.

I don't fully understand what is the impact of not marking down
an OSD that really is dead, particularly if the cluster is under
a heavy write load from many clients.  At the very least, write
requests that have a replica on such an OSD will stall waiting
for an ack that will never come, or a new map, right?

It seems to me that each of the discarded solutions has similar
properties as the favored solution: they address a symptom, rather
than the cause.

Above you mentioned that you are seeing these issues as you scaled
out a storage cluster, but none of the solutions you mentioned
address scaling.  Let's assume your preferred solution handles
this issue perfectly on the biggest cluster anyone has built
today.  What do you predict will happen when that cluster size
is scaled up by a factor of 2, or 10, or 100?

While long-term I'm a proponent of pushing most of this heartbeat
handling logic to the OSDs, in the short term adjustments to the
algorithm are much easier to implement in the monitor (which has a lot
more state on the cluster already local). We came up with a broad
algorithm to estimate the chance that an OSD is laggy instead of down:
first, figure out the probability that the OSD is down based on its
past history, and then figure out that probability for the cluster
that the OSD belongs to.
Basically:
1) Keep track of when an OSD boots if it reports itself as fresh or as
wrongly-marked-down. Maintain the probability that the OSD is actually
down versus laggy based on that data and an exponential decay (more
recent reports matter more), and maintain the length of time the OSD
was laggy for in those cases.
2) When a sufficient number of failure reports come in to mark an OSD
down, additionally compute the laggy probability and laggy interval
for the reporters in aggregate.
3) Adjust the "heartbeat grace" locally on the monitor according to
the following formula:
     adjusted_heartbeat_grace = heartbeat_grace + laggy_interval * (1 /
laggy_probability) + group_laggy_interval * ( 1 /
group_laggy_probability)
4) If we reach the end of that adjusted heartbeat grace, and we have
not received failure cancellations (which already exist; when an OSD
gets a heartbeat from a node it's reported down but which isn't marked
down, the OSD sends a cancellation), then mark the OSD down.
5) When running the out check, adjust the "down to out interval" by
the same ratio we've adjusted the heartbeat grace by.

This algorithm has several nice properties:
1) It allows us to independently account for both the probability that
the node is laggy, and for the length of time the node is usually
laggy for.

This implies to me you think the root cause of lagginess is
independent of client offered load.  Otherwise, if client offered
load does impact lagginess, then your estimate of the probability
that an OSD is laggy is only useful for as long as your offered load
doesn't change, no?

2) It localizes lagginess by PG relationships — if your Ceph cluster
has multiple pools stored in different locations, lagginess won't
cross those boundaries.
3) It's not too expensive, and by framing it the way we have (in terms
of estimating probabilities) we can shuffle the generic algorithm
around (eg, eventually move these calculations to the reporting OSDs).
There are a couple of things it doesn't do:
1) It doesn't do a good job of noticing that a particular rack is
laggy compared to other racks within the same pool.
2) It's all continuous — there isn't yet any sense of "don't guess
anybody is laggy until we've seen a certain amount of churn over the
last x minutes".

We think that this is a good start and that any necessary
modifications will be pretty easy to add, but if you have other ideas
or critiques we'd love to hear about them!

As I mentioned above, I'm concerned this is addressing
symptoms, rather than root causes.  I'm concerned the
root cause has something to do with how the map processing
work scales with number of OSDs/PGs, and that this will
limit the maximum size of a Ceph storage cluster.

But, if you really just want to not mark down an OSD that is
laggy, I know this will sound simplistic, but I keep thinking
that the OSD knows for itself if it's up, even when the
heartbeat mechanism is backed up.  Couldn't there be some way
to ask an OSD suspected of being down whether it is or not,
separate from the heartbeat mechanism?  I mean, if you're
considering having the monitor ignore OSD down reports for a
while based on some estimate of past behavior, wouldn't it be
better for the monitor to just ask such an OSD, "hey, are you
still there?"  If it gets an immediate "I'm busy, come back later",
extend the grace period; otherwise, mark the OSD down.

Or, maybe have a multicast group that OSDs periodically
announce on - anyone considering marking an OSD down
would look for a recent "I'm alive!" announcement from
the OSD in question, and extent the heartbeat grace period
if it saw one.

-- Jim

-Greg

[1]: And we are doing a lot of work to reduce memory consumption, but
while that can delay the problem it can't fix it.
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html