Hello, I am writing this e-mail about an incident that has started last weekend. There seems to something wrong with my e-mail. Some of my e-mails did not reach-out. So I decided to start an new thread here and start from begining. One can find the email related e-mail thread (http://lists.ceph.com/pipermail/ceph-community-ceph.com/2018-September/000292.html). We have a cluster with 28 servers and 168 OSDs. OSDs are blustore on NL-SAS (non SMR) and WAL+DB is NvME. My distro is Archlinux. Last weekend I have upgraded from 12.2.4 to 13.2.1. And cluster did not start since OSDs were stuck in booting state. Sage helped me about it (thanks!) by creating MONs store.db from OSDs via ceph-object-tool. At first everything was perfect. However two days later I had an most unfortunate accident. 7 of my servers crashed at the same time. When they came up cluster was in HEALTH_ERR state. 2 of those servers were MONs (I have 3 total). I’ve been working for 3days collecting and testing. But I could not make any progress. First of all I’ve double checked OS health, network health, disk health and they have no problem. Then my further investigation results are these: I have rbd pool. There is 33TB of VM data. As soon as OSD starts it makes lots of I/O on blustore disks (NL-SAS). This makes OSD near unresponsive. Yu can’t even injectargs. Cluster does not settle. I left it alone for 24 hour but OSD up count dropped to ~50. OSDs are loging too much slow request. OSDs are loging lots of heartbeat messages. And eventually they are marked as down. Latest cluster status: https://paste.ubuntu.com/p/BhCHmVNZsX/ Ceph.conf : https://paste.ubuntu.com/p/FtY9gfpncN/ Sample OSD log: https://paste.ubuntu.com/p/ZsqpcQVRsj/ Mon log: https://paste.ubuntu.com/p/9T8QtMYZWT/ I/O utilization on disks: https://paste.ubuntu.com/p/mrCTKYpBZR/ SO I think my problem is really weird. Somehow pool cannot heal itself. OSDs make %95 disk I/O utilization and peering is way too slow.. The OSD I/O didnt end after 72 hours. Because of the high I/O OSD's cant get an answer from other OSD's and complains to the monitor. Monitor marking them "down" but I see OSD's still running. For example the "ceph -s" command says 50 OSD is up but I see 153 osd process running at background and trying to reach other OSD's. So it is very confusing and certainly not progressing. We're trying every possible strategy. Now we stopped OSDs. Then we start one OSD at a time with a server. First we start, wait for OSD to finish I/O than move the next OSD in the same server. We figured-out that even if the first OSDs I/O is finished second OSD triggers it again. So when we started the final sitxh OSD, the rest of five OSDs did &95 I/O too. And first OSD I/O finished in 8 minutes. But sixth OSD I/O finished in 34 minutes! Then we moved the next server. As soon as we started this servers OSD the previously finished OSD started to do I/O again. So we gained nothing. Now we are plaining to set noup, then start all 168 OSDs and then unset noup. Maybe this will prevent OSDs to make I/O over and over again. After 72 hours I believe we may hit a bug. Any help would be greatly appreciated. We're on IRC 7/24. Thanks to: Be:El, peetaur2, degreaser, Ti and IcePic. _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com