What can you do with osdmap in this case? Istvan Szabo Senior Infrastructure Engineer --------------------------------------------------- Agoda Services Co., Ltd. e: istvan.szabo@xxxxxxxxx<mailto:istvan.szabo@xxxxxxxxx> --------------------------------------------------- On 2021. Nov 12., at 13:22, Josh Baergen <jbaergen@xxxxxxxxxxxxxxxx> wrote: Email received from the internet. If in doubt, don't click any link nor open any attachment ! ________________________________ Hi Marius, We have a 8 hosts cluster with 4TB NVMe drive per host for now. The pool size is 2 and it's hosting RBD images for VMs. Each host has 128GB RAM installed. How many OSDs/host? How many PGs/OSD? Which Ceph version? What is really happening during recovery / backfills that takes this much amount of memory for 1 single OSD? It would be helpful to see what "ceph daemon osd.XXX dump_mempools" command says for an OSD with high memory. One problem that has been seen is the pglogs start consuming quite a bit of memory during recovery scenarios (or even occasionally during steady state). This issue has been alleviated a bit in Octopus+, where there's a limit on the number of pglog entries per OSD, but there are still gaps. Why is the OSD process taking ~100GB RAM and have 25min start time even if the recovery process ended? (unless we wipe it and register it again). This sounds like a pileup of osdmaps. Depending on your Ceph version, all OSDs may need to be up+in in order to trim osdmaps effectively. Josh _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx