Hi, We had an strange issue while adding a new OSD to our Ceph Luminous 12.2.8 cluster. Our cluster has > 300 OSDs based on SSDs and NVMe. After adding a new OSD to the Ceph cluster one of the already running OSDs started to give us slow queries warnings. We checked the OSD and it was working properly, nothing strange on the logs and also it has disk activity. Looks like it stopped serving requests just for one PG.
Request were just piling up, and the number of slow queries was just growing constantly till we restarted the OSD (All our OSDs are running bluestore). We’ve been checking out everything in our setup, and everything is properly configured (This cluster has been running for >5 years and it hosts several thousand VMs.) Beyond finding the real source of the issue -I guess I’ll have to add more OSDs and if it happens again I could just dump the stats of the OSD (ceph daemon osd.X dump_historic_slow_ops) – what I would like to find is
a way to protect the cluster from this kind of issues. I mean, in some scenarios OSDs just suicide -actually I fixed the issue just restarting the offending OSD- but how can we deal with this kind of situation? I’ve been checking around, but I could not find anything (Obviously
we could set our monitoring software to restart any OSD which has more than N slow queries, but I find that a little bit too aggressive).
Is there anything build in Ceph to deal with these situations? A OSD blocking queries in a RBD scenario is a big deal, as plenty of VMs will have disk timeouts which can lead to the VM just panicking. Thanks! Xavier |
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com