Hi, I have a currently-down ceph cluster * v17.2.0 / quay.io/v17.2.0-20220420 * 3 nodes, 4 OSDs * around 1TiB used/3TiB total * probably enough resources - two of those nodes have 64GiB memory, the third has 16GiB - one of the 64GiB nodes runs two OSDs, as it's a physical node with 2 NVMe drives * provisioned via Rook and running in my Kubernetes cluster After some upgrades yesterday (system packages on the nodes) and today (Kubernetes to latest version), I wanted to reboot my nodes. The drain of the first node put a lot of stress on the other OSDs, making them go OOM - but I think that probably is a bug already, as at least one of those nodes has enough resources (64GiB memory, physical machine, ~40GiB surely free - but don't have metrics rn as everything is down). I'm now seeing all OSDs going into OOM right on startup, from what it looks like everything is fine until right after `load_pgs` - as soon as it activates some PGs, memory usage increases _a lot_ (from ~4-5GiB RES before to .. 60GiB, though that depends on the free memory on the node). Because of this, I cannot get any of them online again and need advice what to do and what info might be useful. Logs of one of those OSDs are here[1] (captured via kubectl logs, so something right from start might be missing - happy to dig deeper if you need more) and my changed ceph.conf entries are here[2]. I had `bluefs_buffered_io = false` until today, changed it to true based on a suggestion in another debug thread[3] Any hint is greatly appreciated, many thanks Mara Grosch [1] https://pastebin.com/VFczNqUk [2] https://pastebin.com/QXust5XD [3] https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/message/CBPXLPWEVZLZE55WAQSMB7KSIQPV5I76/
_______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx