kraken-bluestore 11.2.0 memory leak issue

Muthusamy Muthiah <muthiah.muthusamy@xxxxxxxxx> · Tue, 14 Feb 2017 14:22:59 +0530

Hi All,
On all our 5 node cluster with ceph 11.2.0 we encounter memory leak issues.

Cluster details : 5 node with 24/68 disk per node , EC : 4+1 , RHEL 7.2

Some traces using sar are below and attached the memory utilisation graph .

(16:54:42)[cn2.c1 sa] # sar -r
07:50:01 kbmemfree kbmemused %memused kbbuffers kbcached kbcommit %commit kbactive kbinact kbdirty
10:20:01 32077264 132754368 80.54 16176 3040244 77767024 47.18 51991692 2676468 260
10:30:01 32208384 132623248 80.46 16176 3048536 77832312 47.22 51851512 2684552 12
10:40:01 32067244 132764388 80.55 16176 3059076 77832316 47.22 51983332 2694708 264
10:50:01 30626144 134205488 81.42 16176 3064340 78177232 47.43 53414144 2693712 4
11:00:01 28927656 135903976 82.45 16176 3074064 78958568 47.90 55114284 2702892 12
11:10:01 27158548 137673084 83.52 16176 3080600 80553936 48.87 56873664 2708904 12
11:20:01 26455556 138376076 83.95 16176 3080436 81991036 49.74 57570280 2708500 8
11:30:01 26002252 138829380 84.22 16176 3090556 82223840 49.88 58015048 2718036 16
11:40:01 25965924 138865708 84.25 16176 3089708 83734584 50.80 58049980 2716740 12
11:50:01 26142888 138688744 84.14 16176 3089544 83800100 50.84 57869628 2715400 16

...
...

In the attached graph, there is increase in memory utilisation by ceph-osd during soak test. And when it reaches the system limit of 128GB RAM , we could able to see the below dmesg logs related to memory out when the system reaches close to 128GB RAM. OSD.3 killed due to Out of memory and started again.

[Tue Feb 14 03:51:02 2017] tp_osd_tp invoked oom-killer: gfp_mask=0x280da, order=0, oom_score_adj=0
[Tue Feb 14 03:51:02 2017] tp_osd_tp cpuset=/ mems_allowed=0-1
[Tue Feb 14 03:51:02 2017] CPU: 20 PID: 11864 Comm: tp_osd_tp Not tainted 3.10.0-327.13.1.el7.x86_64 #1
[Tue Feb 14 03:51:02 2017] Hardware name: HP ProLiant XL420 Gen9/ProLiant XL420 Gen9, BIOS U19 09/12/2016
[Tue Feb 14 03:51:02 2017]  ffff8819ccd7a280 0000000030e84036 ffff881fa58f7528 ffffffff816356f4
[Tue Feb 14 03:51:02 2017]  ffff881fa58f75b8 ffffffff8163068f ffff881fa3478360 ffff881fa3478378
[Tue Feb 14 03:51:02 2017]  ffff881fa58f75e8 ffff8819ccd7a280 0000000000000001 000000000001f65f
[Tue Feb 14 03:51:02 2017] Call Trace:
[Tue Feb 14 03:51:02 2017]  [<ffffffff816356f4>] dump_stack+0x19/0x1b
[Tue Feb 14 03:51:02 2017]  [<ffffffff8163068f>] dump_header+0x8e/0x214
[Tue Feb 14 03:51:02 2017]  [<ffffffff8116ce7e>] oom_kill_process+0x24e/0x3b0
[Tue Feb 14 03:51:02 2017]  [<ffffffff8116c9e6>] ? find_lock_task_mm+0x56/0xc0
[Tue Feb 14 03:51:02 2017]  [<ffffffff8116d6a6>] out_of_memory+0x4b6/0x4f0
[Tue Feb 14 03:51:02 2017]  [<ffffffff81173885>] __alloc_pages_nodemask+0xa95/0xb90
[Tue Feb 14 03:51:02 2017]  [<ffffffff811b792a>] alloc_pages_vma+0x9a/0x140
[Tue Feb 14 03:51:02 2017]  [<ffffffff811976c5>] handle_mm_fault+0xb85/0xf50
[Tue Feb 14 03:51:02 2017]  [<ffffffff811957fb>] ? follow_page_mask+0xbb/0x5c0
[Tue Feb 14 03:51:02 2017]  [<ffffffff81197c2b>] __get_user_pages+0x19b/0x640
[Tue Feb 14 03:51:02 2017]  [<ffffffff8119843d>] get_user_pages_unlocked+0x15d/0x1f0
[Tue Feb 14 03:51:02 2017]  [<ffffffff8106544f>] get_user_pages_fast+0x9f/0x1a0
[Tue Feb 14 03:51:02 2017]  [<ffffffff8121de78>] do_blockdev_direct_IO+0x1a78/0x2610
[Tue Feb 14 03:51:02 2017]  [<ffffffff81218c40>] ? I_BDEV+0x10/0x10
[Tue Feb 14 03:51:02 2017]  [<ffffffff8121ea65>] __blockdev_direct_IO+0x55/0x60
[Tue Feb 14 03:51:02 2017]  [<ffffffff81218c40>] ? I_BDEV+0x10/0x10
[Tue Feb 14 03:51:02 2017]  [<ffffffff81219297>] blkdev_direct_IO+0x57/0x60
[Tue Feb 14 03:51:02 2017]  [<ffffffff81218c40>] ? I_BDEV+0x10/0x10
[Tue Feb 14 03:51:02 2017]  [<ffffffff8116af63>] generic_file_aio_read+0x6d3/0x750
[Tue Feb 14 03:51:02 2017]  [<ffffffffa038ad5c>] ? xfs_iunlock+0x11c/0x130 [xfs]
[Tue Feb 14 03:51:02 2017]  [<ffffffff811690db>] ? unlock_page+0x2b/0x30
[Tue Feb 14 03:51:02 2017]  [<ffffffff81192f21>] ? __do_fault+0x401/0x510
[Tue Feb 14 03:51:02 2017]  [<ffffffff8121970c>] blkdev_aio_read+0x4c/0x70
[Tue Feb 14 03:51:02 2017]  [<ffffffff811ddcfd>] do_sync_read+0x8d/0xd0
[Tue Feb 14 03:51:02 2017]  [<ffffffff811de45c>] vfs_read+0x9c/0x170
[Tue Feb 14 03:51:02 2017]  [<ffffffff811df182>] SyS_pread64+0x92/0xc0
[Tue Feb 14 03:51:02 2017]  [<ffffffff81645e89>] system_call_fastpath+0x16/0x1b

Feb 14 03:51:40 fr-paris kernel: Out of memory: Kill process 7657 (ceph-osd) score 45 or sacrifice child
Feb 14 03:51:40 fr-paris kernel: Killed process 7657 (ceph-osd) total-vm:8650208kB, anon-rss:6124660kB, file-rss:1560kB
Feb 14 03:51:41 fr-paris systemd: ceph-osd@3.service: main process exited, code=killed, status=9/KILL
Feb 14 03:51:41 fr-paris systemd: Unit ceph-osd@3.service entered failed state.
Feb 14 03:51:41 fr-paris systemd: ceph-osd@3.service failed.
Feb 14 03:51:41 fr-paris systemd: cassandra.service: main process exited, code=killed, status=9/KILL
Feb 14 03:51:41 fr-paris systemd: Unit cassandra.service entered failed state.
Feb 14 03:51:41 fr-paris systemd: cassandra.service failed.
Feb 14 03:51:41 fr-paris ceph-mgr: 2017-02-14 03:51:41.978878 7f51a3154700 -1 mgr ms_dispatch osd_map(7517..7517 src has 6951..7517) v3
Feb 14 03:51:42 fr-paris systemd: Device dev-disk-by\x2dpartlabel-ceph\x5cx20block.device appeared twice with different sysfs paths /sys/devices/pci0000:00/0000:00:03.2/0000:03:00.0/host0/target0:0:0/0:0:0:9/block/sdj/sdj2 and /sys/devices/pci0000:00/0000:00:03.2/0000:03:00.0/host0/target0:0:0/0:0:0:4/block/sde/sde2
Feb 14 03:51:42 fr-paris ceph-mgr: 2017-02-14 03:51:42.992477 7f51a3154700 -1 mgr ms_dispatch osd_map(7518..7518 src has 6951..7518) v3
Feb 14 03:51:43 fr-paris ceph-mgr: 2017-02-14 03:51:43.508990 7f51a3154700 -1 mgr ms_dispatch mgrdigest v1
Feb 14 03:51:48 fr-paris ceph-mgr: 2017-02-14 03:51:48.508970 7f51a3154700 -1 mgr ms_dispatch mgrdigest v1
Feb 14 03:51:53 fr-paris ceph-mgr: 2017-02-14 03:51:53.509592 7f51a3154700 -1 mgr ms_dispatch mgrdigest v1
Feb 14 03:51:58 fr-paris ceph-mgr: 2017-02-14 03:51:58.509936 7f51a3154700 -1 mgr ms_dispatch mgrdigest v1
Feb 14 03:52:01 fr-paris systemd: ceph-osd@3.service holdoff time over, scheduling restart.
Feb 14 03:52:02 fr-paris systemd: Starting Ceph object storage daemon osd.3...
Feb 14 03:52:02 fr-paris systemd: Started Ceph object storage daemon osd.3.
Feb 14 03:52:02 fr-paris numactl: 2017-02-14 03:52:02.307106 7f1e499bb940 -1 WARNING: the following dangerous and experimental features are enabled: bluestore,rocksdb
Feb 14 03:52:02 fr-paris numactl: 2017-02-14 03:52:02.317687 7f1e499bb940 -1 WARNING: the following dangerous and experimental features are enabled: bluestore,rocksdb
Feb 14 03:52:02 fr-paris numactl: starting osd.3 at - osd_data /var/lib/ceph/osd/ceph-3 /var/lib/ceph/osd/ceph-3/journal
Feb 14 03:52:02 fr-paris numactl: 2017-02-14 03:52:02.333522 7f1e499bb940 -1 WARNING: experimental feature 'bluestore' is enabled
Feb 14 03:52:02 fr-paris numactl: Please be aware that this feature is experimental, untested,
Feb 14 03:52:02 fr-paris numactl: unsupported, and may result in data corruption, data loss,
Feb 14 03:52:02 fr-paris numactl: and/or irreparable damage to your cluster.  Do not use
Feb 14 03:52:02 fr-paris numactl: feature with important data.

This seems to happen only in 11.2.0 and not in 11.1.x . Could you please help us in resolving this issue by means of any config change to limit the memory use on ceph-osd or a bug in the current kraken release.

Thanks,
Muthu
Attachment:
kraken-memory-graph.png

Description: PNG image
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com