Cluster outage due to client IO?

Frank Schilder <frans@xxxxxx> · Wed, 10 Jun 2020 06:52:55 +0000

Dear fellow cephers,

we have a nasty and at the same time mysterious problem with our mimic 13.2.8 cluster. The most prominent observations are

- benign operations like adding disks sometimes leads to timeouts and OSDs marked down for no apparent reason, other times everything is normal (see https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/message/VO7FGPDKRDUCENTWR3YDF5Y6KSIQXMLB/)
- out of the blue, OSDs or other daemons get marked down for a period of time after which everything suddenly goes back to normal (see https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/thread/QVQTAL6ZK6SII2PSFXUNWPSOSHL5C6WO/)

All this happens while everything is physically healthy, no network problems, no disk problems, nothing. No daemon actually crashes and the logs do not contain any trace of a malfunction.

At first I thought that some clients overloaded the cluster, but now I'm leaning towards a very rarely hit bug in the MON daemons. My current hypothesis is that a MON daemon can enter an ill state without notice, there is nothing in the logs and the peers don't recognize anything either. An affected MON daemon keeps running and continues to communicate with everything else, yet every now and then it goes bonkers for a while taking an entire cluster out of service. A single ill MON is enough for this to happen, making this a rare high impact problem.

Please find below a collection of observations and attempts to address the periodic service outages we experienced. Unfortunately, the observations are somewhat inconclusive and I hope someone here recognizes a smoking gun. The information was collected over a period of 6 weeks.

I tried to organize the information below by relevance/significance, but also aim for providing a complete re-collection just in case. The structure is as follows:

  1. Description of incident
  2. Actions taken to get the cluster under control
  3. Potential (temporary) resolution
  4. Observations collected during various occurrences

==========================
1. Description of incident

After setting mon_osd_report_timeout and mon_mgr_beacon_grace to 24h I managed to isolate a signal in our monitoring data that clearly marks the window of an incident; see https://imgur.com/a/cg9Yf4E . The images show the network packet reports collected with ganglia on the server- and client side. The blue curve is packets out and the green curve is packets in. The incident started at ca. 10:15 and is clearly visible as a plateau. Note that we had another incident just shortly before that one, ending around 9:30, the ending of which is also visible.

During an incident, typically two MONs start sending an insane amount of packets out. The number goes up to 40000 pkts/s. For comparison, normal operation is only 200 pkts/s. During an incident, the leader stops processing OSD and MGR beacons, but happily marks everything down. I don't remember any other operation stuck, in particular, not MDS beacons. The CPU load of MONs sending packets is slightly above 100%. Memory consumption is normal.

It looks like the packets are sent to a subset of ceph fs clients. I'm not sure if this is connected to actual I/O the clients are doing, I have seen incidents for both cases, heavy and absolutely no I/O load on the file system. The images show an incident with no correlated I/O going on. However, it is remarkable that all affected clients were running a specific software of a specific user, so clients seem to play a role although it looks unlikely that ordinary file operations are causing this.

During an incident window, one has very little time to react. During the first 4-5 minutes the cluster still responds to admin commands like setting OSD flags. As soon as the MONs start reporting slow ops, everything gets stuck and one can only wait for it to be over. Its a complete loss of control. Typically, OSDs and MGRs get marked down en mass and service is lost without a chance to do anything. The cluster is physically healthy the entire time, I did not observe any actual crashes or spontaneous restarts.

The incident is not (easily?) reproducible. I observed two software packages running while incidents happened. However, the software was also running without any issues at other times. The frequency of incident windows was 1-2 per day during the worst periods. It looks like certain client software (operations) increase the chance of this happening, but don't always trigger it.

I collected a lot of information during a one-week period with several incidents, such as

- MON and OSD logs
- full aggregated system log
- dumps of ops and historic ops during incidents
- tcpdumps in both, promiscuous and non-promiscuous mode
- a full tcp dump of 10000 packets (ca. 0.25s) on the leader during an incident window
- more graphs with other data
- results of experiments with different actions

So far I couldn't find anything, probably because I don't know what to look for. I'm happy to make any information available if it helps.

==========================
2. Actions taken to get the cluster under control

The first incident of long ping times (https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/message/VO7FGPDKRDUCENTWR3YDF5Y6KSIQXMLB/), which I believe is related happened while I added a small number of OSDs. The problem went away by itself and I didn't observe it again when adding more OSDs later.

I discovered the second incident of total cluster outage (https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/thread/QVQTAL6ZK6SII2PSFXUNWPSOSHL5C6WO/) very shortly after it started. By coincidence I was looking at the dashboard in the right moment. However, I came too late to the command line. About 2/3 of all OSDs were marked down. The cluster still responded to admin commands and I managed to set nodown and noout, which helped a little. I got a very important hint in the ceph-user thread and after setting mon_osd_report_timeout to 1h the cluster became immediately healthy. I'm not entirely sure though if this was due to the setting of the time out or the incident stopping by itself and the command returning only after the incident was over - it might have been coincidence.

Thereafter it went gradually worse and worse. The incident windows became longer and occurred more and more often. Instead of OSDs being marked down, it now hit the MGRs. I therefore also increased mon_mgr_beacon_grace to larger and larger values to get the cluster stable and to have a chance to debug and collect information.

At some point, the incident window exceeded 1h and OSDs got affected again. We had a couple more total outages and at some point I set the timeouts to 24h, which would allow me to check just twice a day. The longest incident window I observed was something like 1h:15m.

After this, I was trying to catch incidents and experiment with different actions to see what happens. I potentially discovered a way to resolve the problem at least temporarily, see next section. Results of other experiments are listed thereafter.

==========================
3. Potential (temporary) resolution

Restarting the "ill" monitor resolves the issue. I don't know if this issue arises at MON startup or over time under certain conditions. Therefore, I cannot say if a clean restart resolves it permanently or if the monitor can fall ill again. I still have the 24h beacon timeouts just to be sure.

In my case, the leader was probably the culprit. After restarting the leader about 3 weeks ago I have not seen a single incident again.

A possible clue for whether or not a cluster is affected could be the distribution of client connections the monitors hold. I observed an incredibly uneven distribution like ceph-01(leader) 1500, ceph-02 70, ceph-03 300. After restarting ceph-02 nothing happened. After restarting ceph-01 the MONs immediately started re-balancing client connections and ended up with something like ceph-01 650, ceph-02 850, ceph-03 650. This converged over the last 3 weeks to around ceph-01 725, ceph-02 815, ceph-03 625 and varies only little.

It seems that a highly imbalanced distribution of client connections is an early indicator. I watch the distribution since then.

Another early sign might be that the active MGR does not like to run on a specific host any more. Our MGR was always running on ceph-01 (together with the MON leader). At some point, however, I observed that it started failing over to ceph-02 for, again, no apparent reason. There was no actual MGR restart. Even if I forced the MGR on ceph-01 to be active, after a while a different MGR would become the active again.

After re-starting the "ill" MON, this also stopped.

==========================
4. Observations collected during various occurrences

- setting nodown and noout in time will prevent service outage
- if OSDs remain up, client I/O is unaffected (throughput wise), the incident does not add critical load
- restarting a MON with slow OPS does not help, probably unless it is the right one; I never dared to restart the leader during an incident though
- restarting MGRs makes things worse
- disabling the dashboard does not help (I suspected a bug in the dashboard for a while)
- during an incident, the general network load on the entire cluster increases; I could see a very large increase of packets between all (?, many?) servers with tcpdump in promiscuous mode; network hardware was never challenged though

Hope that any of this makes sense to someone and helps to isolate the root cause.

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx