OSDs get killed by OOM when other host goes down

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



We have a 8 hosts cluster with 4TB NVMe drive per host for now. The pool
size is 2 and it's hosting RBD images for VMs.
Each host has 128GB RAM installed.

This week one of the hosts went down.
Right when the recovery started, everything went crazy. OSDs from other
hosts went down being killed by OOM.
When they started again, those OSDs took around 100GB RAM (from 19GB
previously) and took around 25 minutes to start. They were so slow even
after the startup, so that lots of PGs got stuck in peering.
We had to wipe OSD by OSD and register them again to get back to normal
start times and memory consumption.

What is really happening during recovery / backfills that takes this much
amount of memory for 1 single OSD?
Why is the OSD process taking ~100GB RAM and have 25min start time even if
the recovery process ended? (unless we wipe it and register it again).

Any clarification would be appreciated.
Thanks!
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux