High memory usage kills OSD while peering

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hello everybody,
I have Kraken cluster with 660 OSD, currently it is down due to not
being able to complete peering, OSDs start consuming lots of memory
draining the system and killing the node, so I set a limit on the OSD
service (on some OSDs 28G and others as high as 35G), so they get
killed before taking down the whole node.
Now I still can't peer, one OSD entering the cluster (with about 300
already up) makes memory usage of most other OSDs so high (15G+, some as much as 30G) and
sometimes kills them when they reach the service limit. which cause a spiral load and causing all the OSDs to consume all the available.

I found this thread with similar symptoms:

http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-April/017522.html

with a request for stack trace, I have a 14G core dump, we generated it by running the osd from the terminal, enabling the core dumps, and setting ulimits to 15G. what kind of a trace would be useful? all thread?! any better way to debug this?

What can I do do make it work, is this memory allocation normal?

some info about the cluster:
41 hdd nodes with 12 x 4TB osd each, 5 of the nodes have 8TB disks. 324 GB RAM and dula socket intel xeon.
7 nodes with 400GB x 24 ssd and 256GB RAM, and dual socket cpu.
3 monitors

all dual 10GB ethernet, except for the monitor with dual 1GB ethers.

all nodes running centos 7.2
it is an old cluster that was upgraded continuously for the past 3 years. the cluster was on jewel when the issue happened due to some accidental OSD map changes, causing a heavy recovery operations on the cluster. then we upgraded to kraken in the hope of less memory foot prints.

any advice on how to proceed?

Thanks in advance

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html



[Index of Archives]     [CEPH Users]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]
  Powered by Linux