Re: OSDs getting OOM-killed right after startup

Eugen Block <eblock@xxxxxx> · Wed, 08 Jun 2022 09:05:52 +0000

It's even worse, you only give them 1MB, not GB.

Zitat von Eugen Block <eblock@xxxxxx>:

Hi,

is there any reason you use custom configs? Most of the defaults  
work well. But you only give your OSDs 1 GB of memory, that is way  
too low except for an idle cluster without much data. I recommend to  
remove the line

    osd_memory_target = 1048576

and let ceph handle it. I didn't install Quincy yet but in a healthy  
cluster the OSDs will take around 3 GB of memory, maybe 4, so you  
should be good with your setup.

Regards,
Eugen

Zitat von Mara Sophie Grosch <littlefox@xxxxxxxxxx>:

Hi,

I have a currently-down ceph cluster
* v17.2.0 / quay.io/v17.2.0-20220420
* 3 nodes, 4 OSDs
* around 1TiB used/3TiB total
* probably enough resources
 - two of those nodes have 64GiB memory, the third has 16GiB
 - one of the 64GiB nodes runs two OSDs, as it's a physical node with
   2 NVMe drives
* provisioned via Rook and running in my Kubernetes cluster

After some upgrades yesterday (system packages on the nodes) and today
(Kubernetes to latest version), I wanted to reboot my nodes. The drain
of the first node put a lot of stress on the other OSDs, making them go
OOM - but I think that probably is a bug already, as at least one of
those nodes has enough resources (64GiB memory, physical machine, ~40GiB
surely free - but don't have metrics rn as everything is down).

I'm now seeing all OSDs going into OOM right on startup, from what it
looks like everything is fine until right after `load_pgs` - as soon as
it activates some PGs, memory usage increases _a lot_ (from ~4-5GiB
RES before to .. 60GiB, though that depends on the free memory on the
node).

Because of this, I cannot get any of them online again and need advice
what to do and what info might be useful. Logs of one of those OSDs are
here[1] (captured via kubectl logs, so something right from start might
be missing - happy to dig deeper if you need more) and my changed
ceph.conf entries are here[2]. I had `bluefs_buffered_io = false` until
today, changed it to true based on a suggestion in another debug
thread[3]

Any hint is greatly appreciated, many thanks
Mara Grosch

[1] https://pastebin.com/VFczNqUk
[2] https://pastebin.com/QXust5XD
[3]  
https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/message/CBPXLPWEVZLZE55WAQSMB7KSIQPV5I76/

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx