Re: OSD reboot loop after running out of memory

Frédéric Nass <frederic.nass@xxxxxxxxxxxxxxxx> · Wed, 16 Dec 2020 10:23:59 +0100

Regarding RocksDB compaction, if you were in a situation were RocksDB 
had overspilled to HDDs (if your cluster is using an hybrid setup), the 
compaction should have move the bits back to fast devices. So it might 
have helped in this situation too.

Regards,

Frédéric.

Le 16/12/2020 à 09:57, Frédéric Nass a écrit :
Hi Sefan,

This has me thinking that the issue your cluster may be facing is 
probably with bluefs_buffered_io set to true, as this has been 
reported to induce excessive swap usage (and OSDs flapping or OOMing 
as consequences) in some versions starting from Nautilus I believe.

Can you check the value of bluefs_buffered_io that OSDs are currently 
using ? : ceph --admin-daemon /var/run/ceph/ceph-osd.0.asok config 
show | grep bluefs_buffered_io

Can you check the kernel value of vm.swappiness ? : sysctl 
vm.swappiness (default value is 30)

And describe your OSD nodes ? # of HDDS and SSDs/NVMes and HDD/SSD 
ratio, and how much memory they have ?

You should be able to avoid swap usage by setting bluefs_buffered_io 
to false but your cluster / workload might not allow that performance 
and stability wise.
Or you may be able to workaround the excessive swap usage (when 
bluefs_buffered_io is set to true) by lowering vm.swappiness or 
disabling the swap.

Regards,

Frédéric.

Le 14/12/2020 à 22:12, Stefan Wild a écrit :
Hi Frédéric,

Thanks for the additional input. We are currently only running RGW on 
the cluster, so no snapshot removal, but there have been plenty of 
remappings with the OSDs failing (all of them at first during and 
after the OOM incident, then one-by-one). I haven't had a chance to 
look into or test the bluefs_buffered_io setting, but will do that 
next. Initial results from compacting all OSDs' RocksDBs look 
promising (thank you, Igor!). Things have been stable for the past 
two hours, including the two OSDs with issues (one in reboot loop, 
the other with some heartbeats missed), while 15 degraded PGs are 
backfilling.

The ballooning of each OSD to over 15GB memory right after the 
initial crash was even with osd_memory_target set to 2GB. The only 
thing that helped at that point was to temporarily add enough swap 
space to fit 12 x 15GB and let them do their thing. Once they had all 
booted, memory usage went back down to normal levels.

I will report back here with more details when the cluster is 
hopefully back to a healthy state.

Thanks,
Stefan

On 12/14/20, 3:35 PM, "Frédéric Nass" 
<frederic.nass@xxxxxxxxxxxxxxxx> wrote:

     Hi Stefan,

     Initial data removal could also have resulted from a snapshot 
removal
     leading to OSDs OOMing and then pg remappings leading to more 
removals
     after OOMed OSDs rejoined the cluster and so on.

     As mentioned by Igor : "Additionally there are users' reports that
     recent default value's modification for bluefs_buffered_io 
setting has
     negative impact (or just worsen existing issue with massive 
removal) as
     well. So you might want to switch it back to true."

     We're some of them. Our cluster suffered from a severe 
performance drop
     during snapshot removal right after upgrading to Nautilus, due to
     bluefs_buffered_io being set to false by default, with slow 
requests
     observed around the cluster.
     Once back to true (can be done with ceph tell osd.* injectargs
     '--bluefs_buffered_io=true') snap trimming would be fast again 
so as
     before the upgrade, with no more slow requests.

     But of course we've seen the excessive memory swap usage 
described here
     : https://github.com/ceph/ceph/pull/34224
     So we lower osd_memory_target from 8MB to 4MB and haven't 
observed any
     swap usage since then. You can also have a look here :
     https://github.com/ceph/ceph/pull/38044

     What you need to look at to understand if your cluster would 
benefit
     from changing bluefs_buffered_io back to true is the %util of your
     RocksDBD devices on an iostat. Run an iostat -dmx 1 /dev/sdX (if 
you're
     using SSD RocksDB devices) and look at the %util of the device with
     bluefs_buffered_io=false and with bluefs_buffered_io=true. If with
     bluefs_buffered_io=false, the %util is over 75% most of the 
time, then
     you'd better change it to true. :-)

     Regards,

     Frédéric.

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx