As Ceph gets deployed on larger clusters our most common scaling issues have related to 1) our heartbeat system, and 2) handling the larger numbers of OSDMaps that get generated by increases in the OSD (failures, boots, etc) and PG count (osd up-thrus, pg_temp insertions, etc). Lately we haven't had many issues with heartbeats when the OSDs are happy, so it looks like the latest respin of the heartbeat code is going to satisfy us going forward. Fast OSD map generation continues to be a concern, but with the merge of Sam's new map handling code recently (which reduced the amount of disk effort required to process a map and shuffled responsibility out of the main OSD thread and into the more highly-threaded PGs) it has become significantly less expensive, and we have a number of implemented and planned changes (from the short- to the long-term) to continue making it less painful. However, we've started seeing a new issue at the intersection of these separate problems: what happens when an OSD slows down because it's processing too many maps, but continues to operate. In large clusters, an OSD might go down and come back up with hundreds-to-thousands of maps to process — often at the same time as other OSDs. We've started to observe issues during software upgrades where a lot of OSDs come up together and process so many maps that they run out of memory and start swapping[1]. This can easily cause them to miss heartbeats long enough to get marked down — but then they finish map processing, tell the monitors they *are* alive, and get marked back up. This sequence can cause so many new maps to generate that it repeats itself on the new nodes, spreads to other nodes in the cluster, or even causes some running OSD daemons to get marked out. We've taken to calling this "OSD thrashing". It would be great if we could come up with a systemic way to reduce thrashing, independent from our efforts to reduce the triggering factors. (For one thing, when only one node is thrashing we probably want to mark it down to preserve performance, whereas when half the cluster is thrashing we want to keep them up to reduce cluster-wide load increases.) A few weeks ago some of us at Inktank had a meeting to discuss the issue, and I've finally gotten around to writing it up in this email so that we can ask for input from the wider community! After discussing several approaches (including scaling heartbeat intervals as more nodes are marked down, as nodes report being wrongly marked down, putting caps on the number of nodes that can be auto marked down and/or out, applying rate limiters to the auto-marks, etc), we realized that what we really wanted was to do our best to estimate the chances that an OSD which missed its heartbeat window was simply laggy rather than being down. While long-term I'm a proponent of pushing most of this heartbeat handling logic to the OSDs, in the short term adjustments to the algorithm are much easier to implement in the monitor (which has a lot more state on the cluster already local). We came up with a broad algorithm to estimate the chance that an OSD is laggy instead of down: first, figure out the probability that the OSD is down based on its past history, and then figure out that probability for the cluster that the OSD belongs to. Basically: 1) Keep track of when an OSD boots if it reports itself as fresh or as wrongly-marked-down. Maintain the probability that the OSD is actually down versus laggy based on that data and an exponential decay (more recent reports matter more), and maintain the length of time the OSD was laggy for in those cases. 2) When a sufficient number of failure reports come in to mark an OSD down, additionally compute the laggy probability and laggy interval for the reporters in aggregate. 3) Adjust the "heartbeat grace" locally on the monitor according to the following formula: adjusted_heartbeat_grace = heartbeat_grace + laggy_interval * (1 / laggy_probability) + group_laggy_interval * ( 1 / group_laggy_probability) 4) If we reach the end of that adjusted heartbeat grace, and we have not received failure cancellations (which already exist; when an OSD gets a heartbeat from a node it's reported down but which isn't marked down, the OSD sends a cancellation), then mark the OSD down. 5) When running the out check, adjust the "down to out interval" by the same ratio we've adjusted the heartbeat grace by. This algorithm has several nice properties: 1) It allows us to independently account for both the probability that the node is laggy, and for the length of time the node is usually laggy for. 2) It localizes lagginess by PG relationships — if your Ceph cluster has multiple pools stored in different locations, lagginess won't cross those boundaries. 3) It's not too expensive, and by framing it the way we have (in terms of estimating probabilities) we can shuffle the generic algorithm around (eg, eventually move these calculations to the reporting OSDs). There are a couple of things it doesn't do: 1) It doesn't do a good job of noticing that a particular rack is laggy compared to other racks within the same pool. 2) It's all continuous — there isn't yet any sense of "don't guess anybody is laggy until we've seen a certain amount of churn over the last x minutes". We think that this is a good start and that any necessary modifications will be pretty easy to add, but if you have other ideas or critiques we'd love to hear about them! -Greg [1]: And we are doing a lot of work to reduce memory consumption, but while that can delay the problem it can't fix it. -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html