Hi Josh, There is 1 OSD per host. There are 3 pools of 256, 128 and 32 PGs (total = 416 PGs across 8 OSDs). ceph version 15.2.14 (cd3bb7e87a2f62c1b862ff3fd8b1eec13391a5be) octopus (stable) I still have 1 OSD where docker reports 61GB RAM being consumed by the container (we have containerized deployment). The dump_mempools is this one: https://paste2.org/yfng2saG (it reports 38GB) All OSDs are currently up+in and cluster is HEALTH_OK. Thanks! On Fri, Nov 12, 2021 at 2:22 PM Josh Baergen <jbaergen@xxxxxxxxxxxxxxxx> wrote: > Hi Marius, > > > We have a 8 hosts cluster with 4TB NVMe drive per host for now. The pool > > size is 2 and it's hosting RBD images for VMs. > > Each host has 128GB RAM installed. > > How many OSDs/host? How many PGs/OSD? Which Ceph version? > > > What is really happening during recovery / backfills that takes this much > > amount of memory for 1 single OSD? > > It would be helpful to see what "ceph daemon osd.XXX dump_mempools" > command says for an OSD with high memory. One problem that has been > seen is the pglogs start consuming quite a bit of memory during > recovery scenarios (or even occasionally during steady state). This > issue has been alleviated a bit in Octopus+, where there's a limit on > the number of pglog entries per OSD, but there are still gaps. > > > Why is the OSD process taking ~100GB RAM and have 25min start time even > if > > the recovery process ended? (unless we wipe it and register it again). > > This sounds like a pileup of osdmaps. Depending on your Ceph version, > all OSDs may need to be up+in in order to trim osdmaps effectively. > > Josh > _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx