Re: Yet another meltdown starting

Frank Schilder <frans@xxxxxx> · Mon, 11 May 2020 17:43:24 +0000

For everyone who does not want to read the details below: I run now with (dramatically?) increased beacon grace periods for OSD (3600s) and MGR (90s) beacons and am wondering what the downside of this is and if there are better tuning parameters for my issues.

---

Hi Lenz,

I'm wondering about this as well. I was following this and other MGR threads about dashboard crashes with great interest.

It is not exactly the same issue, we are still on mimic and under normal circumstances, all queues are empty. However, I also have the feeling we are hitting a very specific piece of code that has comparably large execution time for little input. I was actually surprised to read that python code is involved in processing high-frequency events in a non-distributed way.

It is said that an MGR is not a single point of failure, but this seems not to be true in the full meaning. If some workload is not distributed, but processed by only one instance (in active-passive way), then

- it does not scale,
- it becomes an effective single point of failure as every instance suffers from the same restriction and
- fail-over will not help as we see a failure of a healthy instance due to load.

This seems to be exactly what I'm observing. What started was a loop of doom:

- active MGR fails to send beacon in time,
- MON marks MGR out and elects next available standby
- new MGR takes over, is hit by the same problem and does not send its beacon in time,
- the other MGR in the meantime reconnects, but slower than the next MGR is marked out.

After a while, the MONs were cyclically kicking out the active MGR. Each MGR stayed only active for the beacon grace period and was then thrown out. Note that none of the MGR processes crashed or died. Everything was up and running. I observed a client-load induced evict-reconnect cycle.

What is suspicious is that I cannot see any significant increase of load or network traffic on the MGR node during the critical time before the incident (there was, of course, a huge increase of client traffic to the limit of the hardware). It looks like something completely under radar, a tiny bit of some very specific processing has a huge impact under high client load, like the difference between compiled and interpreted code in the issue you mentioned.

There also seem to be accumulative issues like memory leaks. I observe regular crashes of the dashboard and the dashboard creates quite a large load for not much as well.

I will put this case on the list of reference of a future thread "Cluster outage due to client IO" I'm preparing. All the major issues I was observing lately have to do specifically with beacons sent to the MONs. I did not see any heartbeat failures. The cluster was physically healthy (everything up and running) but not logically (MONs did not get required info in time) and increasing the beacon grace periods immediately restored logical cluster health. It looks like beacons are processed in a different way than heartbeats and that there is a critical bottleneck somewhere.

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Lenz Grimmer <lgrimmer@xxxxxxxx>
Sent: 11 May 2020 17:50:34
To: ceph-users@xxxxxxx
Subject:  Re: Yet another meltdown starting

Hi Frank,

On 5/11/20 3:03 PM, Frank Schilder wrote:

> OK, the command finally executed and it looks like the cluster is
> running stable for now. However, I'm afraid that 90s might not be
> sustainable.
>
> Questions: Can I leave the beacon_grace at 90s? Is there a better
> parameter to set? Why is the MGR getting overloaded on a rather small
> cluster with 160 OSDs? How does this scale?

I wonder if https://tracker.ceph.com/issues/45439 might be related to
what you're observing here?

In this issue, Andras suggests: "Increasing mgr_stats_period to 15
seconds reduces the load and brings ceph-mgr back to responsive again."

Maybe that helps?

Lenz

--
SUSE Software Solutions Germany GmbH - Maxfeldstr. 5 - 90409 Nuernberg
GF: Felix Imendörffer, HRB 36809 (AG Nürnberg)
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx