Re: Ceph OSDs cause kernel unresponsive

Craig Chi <craigchi@xxxxxxxxxxxx> · Thu, 24 Nov 2016 18:37:09 +0800

Hi Nick,

Thank you for your helpful information.

I knew that Ceph recommends 1GB/1TB RAM, but we are not going to change the hardware architecture now.
Are there any methods to set the resource limit one OSD can consume?

And for your question, we currently set system configuration as:

vm.swappiness=10

kernel.pid_max=4194303

fs.file-max=26234859

vm.zone_reclaim_mode=0

vm.vfs_cache_pressure=50

vm.min_free_kbytes=4194303

I would try to configure vm.min_free_kbytes larger and test.
I will be grateful if anyone has the experience of how to tune these values for Ceph.

Sincerely,

Craig Chi

On 2016-11-24 17:48, Nick Fisk <nick@xxxxxxxxxx> wrote:

Hi Craig,

From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of Craig Chi

Sent: 24 November 2016 08:34

To: ceph-users@xxxxxxxxxxxxxx

Subject:  Ceph OSDs cause kernel unresponsive

Hi Cephers,

We have encountered kernel hanging issue on our Ceph cluster. Just like http://imgur.com/a/U2Flz , http://imgur.com/a/lyEko or http://imgur.com/a/IGXdu .

We believed it is caused by out of memory, because we observed that when OSDs went crazy, the available memory of each node were decreasing rapidly (from 50% available to lower than 10%). Then the node running Ceph OSD became unresponsive with console showing hung_task_timout or slab_out_of_memory, etc. The only thing we can do then is hard reset the unit.

It is hard to predict when the kernel hanging issue will happen. In my past experiences, it usually happened after a long term benchmark procedure, and followed by a manual trigger like 1) reboot a node 2) restart all OSDs 3) modify CRUSH map.

Currently the cluster is back to normal, but we want to figure out the root cause to avoid happening again. We think the high values of ceph.conf are pretty suspicous, but without code tracing we are hard to realize the impact of the values and the memory consumption.

Many thanks if you have any suggestions.

I think you are probably running out of memory, 90x8TB disks is 720Tb of storage, that will need a lot of ram to run and also the fact that the problems occur when PG’s start moving around after a node failure also suggests this.

Have you adjusted your vm.vfs_cache_pressure?

You might also want to try setting vm.min_free_kbytes to 8-16GB to try and keep some memory free and avoid fragmentation.

=================================================================================

Following is our ceph cluster architecture:

OS: Ubuntu 16.04.1 LTS (4.4.0-31-generic #50-Ubuntu x86_64 GNU/Linux)

Ceph: Jewel 10.2.3

3 Ceph Monitors running on 3 dedicated machines

630 Ceph OSDs running on 7 storage machines (each machine has 256GB RAM and 90 units of 8TB hard drives)

There are 4 pools with following settings:

vms     512  pg x 3 replica

images  512  pg x 3 replica

volumes 8192 pg x 3 replica

objects 4096 pg x (17,3) erasure code profile

==> average 173.92 pgs per OSD

We tuned our ceph.conf by referencing many performance tuning resources online ( mainly from slide 38 of https://goo.gl/Idkh41 )

[global]

osd pool default pg num = 4096

osd pool default pgp num = 4096

err to syslog = true

log to syslog = true

osd pool default size = 3

max open files = 131072

fsid = 1c33bf75-e080-4a70-9fd8-860ff216f595

osd crush chooseleaf type = 1

[mon.mon1]

host = mon1

mon addr = 172.20.1.2

[mon.mon2]

host = mon2

mon addr = 172.20.1.3

[mon.mon3]

host = mon3

mon addr = 172.20.1.4

[mon]

mon osd full ratio = 0.85

mon osd nearfull ratio = 0.7

mon osd down out interval = 600

mon osd down out subtree limit = host

mon allow pool delete = true

mon compact on start = true

[osd]

public_network = 172.20.3.1/21

cluster_network = 172.24.0.1/24

osd disk threads = 4

osd mount options xfs = rw,noexec,nodev,noatime,nodiratime,nobarrier,inode64,logbsize=256k

osd crush update on start = false

osd op threads = 20

osd mkfs options xfs = -f -i size=2048

osd max write size = 512

osd mkfs type = xfs

osd journal size = 5120

filestore max inline xattrs = 6

filestore queue committing max bytes = 1048576000

filestore queue committing max ops = 5000

filestore queue max bytes = 1048576000

filestore op threads = 32

filestore max inline xattr size = 254

filestore max sync interval = 15

filestore min sync interval = 10

journal max write bytes = 1048576000

journal max write entries = 1000

journal queue max ops = 3000

journal queue max bytes = 1048576000

ms dispatch throttle bytes = 1048576000

Sincerely,

Craig Chi

Sent from Synology MailPlus

Sent from Synology MailPlus

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com