Multiple OSDs down, and won't come up (possibly related to other Nautilus issues)

aoanla@xxxxxxxxx · Wed, 01 Apr 2020 13:10:46 -0000

So, this is following on from a discussion in the #ceph IRC channel, where we seem to have reached the limit of what we can do.

I have a ~15 node, 311 OSD cluster. (20 OSDs per node). 
The cluster is Nautilus - the 3 MONs + the first 8 OSD hosts were installed as Mimic and upgraded to Nautilus with ceph-ansible ; the remaining OSD hosts were added directly with Nautilus as they were only added in a few weeks ago.

Yesterday, suddenly, about half of the OSDs (~140) were marked Down, and a number of slow operations were detected.

Initially, examining the logs (and with a bit of help from IRC), I noticed that the ansible roles used to build the newer OSDs had configured chrony incorrectly, and their clocks were drifting. 
(There were BADAUTHORIZER errors in OSD logs, too.) 

I fixed the chrony configuration... and we (including people in IRC) expected everything to just... stabilise.

Things have not stabilised, which leads me to suspect that there are other issues at play.

After noticing a number of issues with mgrs deadlocking in Nautilus - eg https://tracker.ceph.com/issues/17170 https://tracker.ceph.com/issues/43048 - I tried stopping all mgrs and mons, and then slowly bringing them up.
This has not helped. 

Interestingly, the OSDs with slow ops (some of which are marked down) report ops_in_flight which are "wait for new map", whilst the lead mon believes those same ops are timed out. 
(I can of course, telnet to every OSD, even the down ones, from other OSDs, including ones which report issues talking to them on the same port; and from the lead mon.)

I am wondering if this is an example of: https://tracker.ceph.com/issues/44184  as we did create a new pool shortly after adding the new OSD host nodes... but it isn't clear from that ticket [or the discussion on this list] how to fix this, other than removing the pool - which I can't do, as we need this pool to exist, and the pool is replaces needs to be decomissioned.

Can anyone advise what I should do next? At present, obviously, the cluster is unusable.
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx