OSDs get killed by OOM when other host goes down

Marius Leustean <marius.leus@xxxxxxxxx> · Fri, 12 Nov 2021 11:15:36 +0200

We have a 8 hosts cluster with 4TB NVMe drive per host for now. The pool
size is 2 and it's hosting RBD images for VMs.
Each host has 128GB RAM installed.

This week one of the hosts went down.
Right when the recovery started, everything went crazy. OSDs from other
hosts went down being killed by OOM.
When they started again, those OSDs took around 100GB RAM (from 19GB
previously) and took around 25 minutes to start. They were so slow even
after the startup, so that lots of PGs got stuck in peering.
We had to wipe OSD by OSD and register them again to get back to normal
start times and memory consumption.

What is really happening during recovery / backfills that takes this much
amount of memory for 1 single OSD?
Why is the OSD process taking ~100GB RAM and have 25min start time even if
the recovery process ended? (unless we wipe it and register it again).

Any clarification would be appreciated.
Thanks!
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx