One slow OSD, causing a dozen of warnings

胡玮文 <huww98@xxxxxxxxxxx> · Sat, 17 Jul 2021 07:30:12 +0000

Hi all,

We have experienced something strange and scary on our ceph cluster yesterday. Now our cluster is back to health. I want to share our experience here and hopefully, someone can help us find the root cause and prevent it from happening again.

TL;DR--An OSD became very slow for an unknown reason and was resolved by restarting the OSD daemon. Then I suspect MGR and MON were overloaded by a lot of slow ops cluster logs.

I noticed this about one hour after the cluster got into HEALTH_WARN, I had MDS_TRIM, MDS_HEALTH_CLIENT_LATE_RELEASE, MDS_CLIENT_OLDEST_TID, MDS_SLOW_METADATA_IO, SLOW_OPS, as far as I can remember.

There was about 20k slow ops, spreading over all OSDs, When inspecting them, I found out almost all slow ops are waiting for sub-ops from OSD.25, while ops in OSD.25 itself are queued for pg. This OSD was still working but exceptionally slow. Block device write IO wait time was 2s [1], and osd write op latency was over 1200s [2]. At the time all these starts, about 2000 iops write requests hit the cluster, but all other OSDs work well. For comparison, another OSD (located in the same host and have identical configuration) had write IO wait time <10ms, write op latency <50ms.

Then I restarted OSD.25 daemon. Most things are back to normal (So this does not look like a hardware issue). But why it ever got into this state is still a mystery. I can only find logs complaining about slow ops.

I was left with about 2k slow ops spreading over all 5 MONs then. Most of these are logm requests. I was thinking that this can be resolved by itself, but things go worse again. I started to get MON_DISK_BIG, MON_DISK_LOW. This is scary because all MON's disk usage seems to go up unbonded, and the cluster cannot work without MON. And one of our hosts goes OOM.

As for the OOM, I suspect it is the MGR that taking too much memory, because after the stand by MGR took over, its memory usage is also growing unbounded. And I suspect the MGR has some memory leakage when processing cluster log, because 30 min after the OSD slow ops are cleared, slow ops logs from over 30 min ago are still rolling in the dashboard.

The MON cluster is also having a hard time. After I reset the OOM'd host, the MON daemon on that host spent 30 min before it can rejoin the quorum. Almost at the same time, warnings about the MONs are cleared. I don't know if it is I restarted another MON that works, or they just finished processing all logs. The MON disk usage is also going back to <200MB.

So how can we prevent this from happening again? For the slow OSD, I think it is hard to find the root cause. But I appreciate it if someone can give me a hint. I think ceph may have an unbounded queue for cluster log in OSD, MON and/or MGR. If so, I think we should drop logs if upstream cannot handle it instead on queue it up. Also, MONs and MGRs should have some mechanisms to prevent themselves from being overloaded by logs.

We are running ceph 16.2.5, with ~30 OSDs, most of them are HDD with DB in SSD.

Regards,
Weiwen Hu

[1]: PromQL: rate(node_disk_write_time_seconds_total[1m]) / rate(node_disk_writes_completed_total[1m])
[2]: PromQL: rate(ceph_osd_op_w_latency_sum[1m]) / rate(ceph_osd_op_w_latency_count[1m])
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx