We have a 8 hosts cluster with 4TB NVMe drive per host for now. The pool size is 2 and it's hosting RBD images for VMs. Each host has 128GB RAM installed. This week one of the hosts went down. Right when the recovery started, everything went crazy. OSDs from other hosts went down being killed by OOM. When they started again, those OSDs took around 100GB RAM (from 19GB previously) and took around 25 minutes to start. They were so slow even after the startup, so that lots of PGs got stuck in peering. We had to wipe OSD by OSD and register them again to get back to normal start times and memory consumption. What is really happening during recovery / backfills that takes this much amount of memory for 1 single OSD? Why is the OSD process taking ~100GB RAM and have 25min start time even if the recovery process ended? (unless we wipe it and register it again). Any clarification would be appreciated. Thanks! _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx