Sick Nautilus cluster, OOM killing OSDs, lots of osdmaps

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi all

 

I have a smallish test cluster (14 servers, 84 OSDs) running 14.2.4.  Monthly OS patching and reboots that go along with it have resulted in the cluster getting very unwell.

 

Many of the servers in the cluster are OOM-killing the ceph-osd processes when they try to start.  (6 OSDs per server running on filestore.). Strace shows the ceph-osd processes are spending hours reading through the 220k osdmap files after being started. 

 

This behavior started after we recently made it about 72% full to see how things behaved.  We also upgraded it to Nautilus 14.2.2 at about the same time.

 

I’ve tried starting just one OSD per server at a time in hopes of avoiding the OOM killer.  Also tried setting noin, rebooting the whole cluster, waiting a day, then marking each of the OSDs in manually.  The end result is the same either way.  About 60% of PGs are still down, 30% are peering, and the rest are in worse shape.

 

Anyone out there have suggestions about how I should go about getting this cluster healthy again?  Any ideas appreciated.

 

Thanks!

 

- Aaron

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux