Re: OSD reboot loop after running out of memory

Stefan Wild <swild@xxxxxxxxxxxxx> · Mon, 14 Dec 2020 21:12:41 +0000

Hi Frédéric,

Thanks for the additional input. We are currently only running RGW on the cluster, so no snapshot removal, but there have been plenty of remappings with the OSDs failing (all of them at first during and after the OOM incident, then one-by-one). I haven't had a chance to look into or test the bluefs_buffered_io setting, but will do that next. Initial results from compacting all OSDs' RocksDBs look promising (thank you, Igor!). Things have been stable for the past two hours, including the two OSDs with issues (one in reboot loop, the other with some heartbeats missed), while 15 degraded PGs are backfilling.

The ballooning of each OSD to over 15GB memory right after the initial crash was even with osd_memory_target set to 2GB. The only thing that helped at that point was to temporarily add enough swap space to fit 12 x 15GB and let them do their thing. Once they had all booted, memory usage went back down to normal levels.

I will report back here with more details when the cluster is hopefully back to a healthy state.

Thanks,
Stefan

On 12/14/20, 3:35 PM, "Frédéric Nass" <frederic.nass@xxxxxxxxxxxxxxxx> wrote:

    Hi Stefan,

    Initial data removal could also have resulted from a snapshot removal 
    leading to OSDs OOMing and then pg remappings leading to more removals 
    after OOMed OSDs rejoined the cluster and so on.

    As mentioned by Igor : "Additionally there are users' reports that 
    recent default value's modification for bluefs_buffered_io setting has 
    negative impact (or just worsen existing issue with massive removal) as 
    well. So you might want to switch it back to true."

    We're some of them. Our cluster suffered from a severe performance drop 
    during snapshot removal right after upgrading to Nautilus, due to 
    bluefs_buffered_io being set to false by default, with slow requests 
    observed around the cluster.
    Once back to true (can be done with ceph tell osd.* injectargs 
    '--bluefs_buffered_io=true') snap trimming would be fast again so as 
    before the upgrade, with no more slow requests.

    But of course we've seen the excessive memory swap usage described here 
    : https://github.com/ceph/ceph/pull/34224
    So we lower osd_memory_target from 8MB to 4MB and haven't observed any 
    swap usage since then. You can also have a look here : 
    https://github.com/ceph/ceph/pull/38044

    What you need to look at to understand if your cluster would benefit 
    from changing bluefs_buffered_io back to true is the %util of your 
    RocksDBD devices on an iostat. Run an iostat -dmx 1 /dev/sdX (if you're 
    using SSD RocksDB devices) and look at the %util of the device with 
    bluefs_buffered_io=false and with bluefs_buffered_io=true. If with 
    bluefs_buffered_io=false, the %util is over 75% most of the time, then 
    you'd better change it to true. :-)

    Regards,

    Frédéric.

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx