Re: Sick Nautilus cluster, OOM killing OSDs, lots of osdmaps

Sage Weil <sage@xxxxxxxxxxxx> · Wed, 9 Oct 2019 15:30:02 +0000 (UTC)

[adding dev]

On Wed, 9 Oct 2019, Aaron Johnson wrote:
> Hi all
> 
> I have a smallish test cluster (14 servers, 84 OSDs) running 14.2.4.  
> Monthly OS patching and reboots that go along with it have resulted in 
> the cluster getting very unwell.
> 
> Many of the servers in the cluster are OOM-killing the ceph-osd 
> processes when they try to start.  (6 OSDs per server running on 
> filestore.). Strace shows the ceph-osd processes are spending hours 
> reading through the 220k osdmap files after being started.

Is the process size growing during this time?  There should be a cap to 
the size of the OSDMap cache; perhaps there is a regression there.

One common thing to do here is 'ceph osd set noup' and restart the OSD, 
and then monitor the OSD's progress catching up on maps with 'ceph daemon 
osd.NN status' (compare the epoch to what you get from 'ceph osd dump | 
head').  This will take a while if you are really 220k maps (!!!) behind,
but the memory usage during that period should be relatively constant.

> This behavior started after we recently made it about 72% full to see 
> how things behaved.  We also upgraded it to Nautilus 14.2.2 at about the 
> same time.
> 
> I’ve tried starting just one OSD per server at a time in hopes of 
> avoiding the OOM killer.  Also tried setting noin, rebooting the whole 
> cluster, waiting a day, then marking each of the OSDs in manually.  The 
> end result is the same either way.  About 60% of PGs are still down, 30% 
> are peering, and the rest are in worse shape.

Usually in instances like this in the past, getting all OSDs to catch up 
on maps and then unsetting 'noup' will let them all come up and peer at 
the same time.  But usually what has happened is many of the OSDs are not 
caught up and it's not immediately obvious, so PGs don't peer.  So setting 
noup and waiting for all osds to be caught up (as per 'ceph daemon osd.NNN 
status') first generally helps.

But none of that explains why you're seeing OOM, so I'm curious what you 
see with memory usage while OSDs are catching up...

Thanks!
sage

 > 
> Anyone out there have suggestions about how I should go about getting 
> this cluster healthy again?  Any ideas appreciated.
> 
> Thanks!
> 
> - Aaron
> 
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx