Re: OSDs getting OOM-killed right after startup

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



It's even worse, you only give them 1MB, not GB.

Zitat von Eugen Block <eblock@xxxxxx>:

Hi,

is there any reason you use custom configs? Most of the defaults work well. But you only give your OSDs 1 GB of memory, that is way too low except for an idle cluster without much data. I recommend to remove the line

    osd_memory_target = 1048576

and let ceph handle it. I didn't install Quincy yet but in a healthy cluster the OSDs will take around 3 GB of memory, maybe 4, so you should be good with your setup.

Regards,
Eugen

Zitat von Mara Sophie Grosch <littlefox@xxxxxxxxxx>:

Hi,

I have a currently-down ceph cluster
* v17.2.0 / quay.io/v17.2.0-20220420
* 3 nodes, 4 OSDs
* around 1TiB used/3TiB total
* probably enough resources
 - two of those nodes have 64GiB memory, the third has 16GiB
 - one of the 64GiB nodes runs two OSDs, as it's a physical node with
   2 NVMe drives
* provisioned via Rook and running in my Kubernetes cluster

After some upgrades yesterday (system packages on the nodes) and today
(Kubernetes to latest version), I wanted to reboot my nodes. The drain
of the first node put a lot of stress on the other OSDs, making them go
OOM - but I think that probably is a bug already, as at least one of
those nodes has enough resources (64GiB memory, physical machine, ~40GiB
surely free - but don't have metrics rn as everything is down).

I'm now seeing all OSDs going into OOM right on startup, from what it
looks like everything is fine until right after `load_pgs` - as soon as
it activates some PGs, memory usage increases _a lot_ (from ~4-5GiB
RES before to .. 60GiB, though that depends on the free memory on the
node).

Because of this, I cannot get any of them online again and need advice
what to do and what info might be useful. Logs of one of those OSDs are
here[1] (captured via kubectl logs, so something right from start might
be missing - happy to dig deeper if you need more) and my changed
ceph.conf entries are here[2]. I had `bluefs_buffered_io = false` until
today, changed it to true based on a suggestion in another debug
thread[3]

Any hint is greatly appreciated, many thanks
Mara Grosch

[1] https://pastebin.com/VFczNqUk
[2] https://pastebin.com/QXust5XD
[3] https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/message/CBPXLPWEVZLZE55WAQSMB7KSIQPV5I76/



_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux