Hi Greg, Thanks for the write-up. I have a couple questions below. On 07/30/2012 12:46 PM, Gregory Farnum wrote:
As Ceph gets deployed on larger clusters our most common scaling issues have related to 1) our heartbeat system, and 2) handling the larger numbers of OSDMaps that get generated by increases in the OSD (failures, boots, etc) and PG count (osd up-thrus, pg_temp insertions, etc). Lately we haven't had many issues with heartbeats when the OSDs are happy, so it looks like the latest respin of the heartbeat code is going to satisfy us going forward. Fast OSD map generation continues to be a concern, but with the merge of Sam's new map handling code recently (which reduced the amount of disk effort required to process a map and shuffled responsibility out of the main OSD thread and into the more highly-threaded PGs) it has become significantly less expensive, and we have a number of implemented and planned changes (from the short- to the long-term) to continue making it less painful. However, we've started seeing a new issue at the intersection of these separate problems: what happens when an OSD slows down because it's processing too many maps, but continues to operate. In large clusters, an OSD might go down and come back up with hundreds-to-thousands of maps to process — often at the same time as other OSDs. We've started to observe issues during software upgrades where a lot of OSDs come up together and process so many maps that they run out of memory and start swapping[1]. This can easily cause them to miss heartbeats long enough to get marked down — but then they finish map processing, tell the monitors they *are* alive, and get marked back up. This sequence can cause so many new maps to generate that it repeats itself on the new nodes, spreads to other nodes in the cluster, or even causes some running OSD daemons to get marked out. We've taken to calling this "OSD thrashing". It would be great if we could come up with a systemic way to reduce thrashing, independent from our efforts to reduce the triggering factors. (For one thing, when only one node is thrashing we probably want to mark it down to preserve performance, whereas when half the cluster is thrashing we want to keep them up to reduce cluster-wide load increases.) A few weeks ago some of us at Inktank had a meeting to discuss the issue, and I've finally gotten around to writing it up in this email so that we can ask for input from the wider community! After discussing several approaches (including scaling heartbeat intervals as more nodes are marked down, as nodes report being wrongly marked down, putting caps on the number of nodes that can be auto marked down and/or out, applying rate limiters to the auto-marks, etc), we realized that what we really wanted was to do our best to estimate the chances that an OSD which missed its heartbeat window was simply laggy rather than being down.
I don't understand the functional difference between an OSD that is too busy to process its heartbeat in a timely fashion, and one that is down. In either case, it cannot meet its obligations to its peers. I understand that wrongly marking an OSD down adds unnecessary map processing work. Also, if an OSD is wrongly marked down then any data that would be written to it while it is marked down will be written to other OSDs, and will need to be migrated when that OSD is marked back up. I don't fully understand what is the impact of not marking down an OSD that really is dead, particularly if the cluster is under a heavy write load from many clients. At the very least, write requests that have a replica on such an OSD will stall waiting for an ack that will never come, or a new map, right? It seems to me that each of the discarded solutions has similar properties as the favored solution: they address a symptom, rather than the cause. Above you mentioned that you are seeing these issues as you scaled out a storage cluster, but none of the solutions you mentioned address scaling. Let's assume your preferred solution handles this issue perfectly on the biggest cluster anyone has built today. What do you predict will happen when that cluster size is scaled up by a factor of 2, or 10, or 100?
While long-term I'm a proponent of pushing most of this heartbeat handling logic to the OSDs, in the short term adjustments to the algorithm are much easier to implement in the monitor (which has a lot more state on the cluster already local). We came up with a broad algorithm to estimate the chance that an OSD is laggy instead of down: first, figure out the probability that the OSD is down based on its past history, and then figure out that probability for the cluster that the OSD belongs to. Basically: 1) Keep track of when an OSD boots if it reports itself as fresh or as wrongly-marked-down. Maintain the probability that the OSD is actually down versus laggy based on that data and an exponential decay (more recent reports matter more), and maintain the length of time the OSD was laggy for in those cases. 2) When a sufficient number of failure reports come in to mark an OSD down, additionally compute the laggy probability and laggy interval for the reporters in aggregate. 3) Adjust the "heartbeat grace" locally on the monitor according to the following formula: adjusted_heartbeat_grace = heartbeat_grace + laggy_interval * (1 / laggy_probability) + group_laggy_interval * ( 1 / group_laggy_probability) 4) If we reach the end of that adjusted heartbeat grace, and we have not received failure cancellations (which already exist; when an OSD gets a heartbeat from a node it's reported down but which isn't marked down, the OSD sends a cancellation), then mark the OSD down. 5) When running the out check, adjust the "down to out interval" by the same ratio we've adjusted the heartbeat grace by. This algorithm has several nice properties: 1) It allows us to independently account for both the probability that the node is laggy, and for the length of time the node is usually laggy for.
This implies to me you think the root cause of lagginess is independent of client offered load. Otherwise, if client offered load does impact lagginess, then your estimate of the probability that an OSD is laggy is only useful for as long as your offered load doesn't change, no?
2) It localizes lagginess by PG relationships — if your Ceph cluster has multiple pools stored in different locations, lagginess won't cross those boundaries. 3) It's not too expensive, and by framing it the way we have (in terms of estimating probabilities) we can shuffle the generic algorithm around (eg, eventually move these calculations to the reporting OSDs). There are a couple of things it doesn't do: 1) It doesn't do a good job of noticing that a particular rack is laggy compared to other racks within the same pool. 2) It's all continuous — there isn't yet any sense of "don't guess anybody is laggy until we've seen a certain amount of churn over the last x minutes". We think that this is a good start and that any necessary modifications will be pretty easy to add, but if you have other ideas or critiques we'd love to hear about them!
As I mentioned above, I'm concerned this is addressing symptoms, rather than root causes. I'm concerned the root cause has something to do with how the map processing work scales with number of OSDs/PGs, and that this will limit the maximum size of a Ceph storage cluster. But, if you really just want to not mark down an OSD that is laggy, I know this will sound simplistic, but I keep thinking that the OSD knows for itself if it's up, even when the heartbeat mechanism is backed up. Couldn't there be some way to ask an OSD suspected of being down whether it is or not, separate from the heartbeat mechanism? I mean, if you're considering having the monitor ignore OSD down reports for a while based on some estimate of past behavior, wouldn't it be better for the monitor to just ask such an OSD, "hey, are you still there?" If it gets an immediate "I'm busy, come back later", extend the grace period; otherwise, mark the OSD down. Or, maybe have a multicast group that OSDs periodically announce on - anyone considering marking an OSD down would look for a recent "I'm alive!" announcement from the OSD in question, and extent the heartbeat grace period if it saw one. -- Jim
-Greg [1]: And we are doing a lot of work to reduce memory consumption, but while that can delay the problem it can't fix it. -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html
-- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html