Re: OSDs get killed by OOM when other host goes down

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Marius,

> We have a 8 hosts cluster with 4TB NVMe drive per host for now. The pool
> size is 2 and it's hosting RBD images for VMs.
> Each host has 128GB RAM installed.

How many OSDs/host? How many PGs/OSD? Which Ceph version?

> What is really happening during recovery / backfills that takes this much
> amount of memory for 1 single OSD?

It would be helpful to see what "ceph daemon osd.XXX dump_mempools"
command says for an OSD with high memory. One problem that has been
seen is the pglogs start consuming quite a bit of memory during
recovery scenarios (or even occasionally during steady state). This
issue has been alleviated a bit in Octopus+, where there's a limit on
the number of pglog entries per OSD, but there are still gaps.

> Why is the OSD process taking ~100GB RAM and have 25min start time even if
> the recovery process ended? (unless we wipe it and register it again).

This sounds like a pileup of osdmaps. Depending on your Ceph version,
all OSDs may need to be up+in in order to trim osdmaps effectively.

Josh
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux