Sick Nautilus cluster, OOM killing OSDs, lots of osdmaps

Aaron Johnson <ajohnson1@xxxxxxxxxxx> · Wed, 9 Oct 2019 15:20:32 +0000

Hi all

I have a smallish test cluster (14 servers, 84 OSDs) running 14.2.4.  Monthly OS patching and reboots that go along with it have resulted in the cluster getting very unwell.

Many of the servers in the cluster are OOM-killing the ceph-osd processes when they try to start.  (6 OSDs per server running on filestore.). Strace shows the ceph-osd processes are spending hours reading
 through the 220k osdmap files after being started.  

This behavior started after we recently made it about 72% full to see how things behaved.  We also upgraded it to Nautilus 14.2.2 at about the same time.

I’ve tried starting just one OSD per server at a time in hopes of avoiding the OOM killer.  Also tried setting noin, rebooting the whole cluster, waiting a day, then marking each of the OSDs in manually. 
 The end result is the same either way.  About 60% of PGs are still down, 30% are peering, and the rest are in worse shape.

Anyone out there have suggestions about how I should go about getting this cluster healthy again?  Any ideas appreciated.

Thanks!

- Aaron

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx