Re: OSD killed by OOM when many cache available

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



One thing that doesn't show up is fs cache, which is likely the cause here. We went through this on our SSDs and had to add the following to stop the crashes. I believe vm.vfs_cache_pressure and min_free_kbytes were the really helpful things in getting the crashes to stop. HTH!

sysctl_param 'vm.vfs_cache_pressure' do

  value 400

end

sysctl_param 'vm.dirty_ratio' do

  value 20

end

sysctl_param 'vm.dirty_background_ratio' do

  value 2

end

sysctl_param 'vm.min_free_kbytes' do

  value 4194304

end



On Fri, Nov 17, 2017 at 4:24 PM, Sam Huracan <nowitzki.sammy@xxxxxxxxx> wrote:
I see some more logs about memory in syslog:

Nov 17 10:47:17 ceph1 kernel: [2810698.553749] Node 0 DMA free:14828kB min:32kB low:40kB high:48kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15996kB managed:15896kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
Nov 17 10:47:17 ceph1 kernel: [2810698.553753] lowmem_reserve[]: 0 1830 32011 32011 32011
Nov 17 10:47:17 ceph1 kernel: [2810698.553755] Node 0 DMA32 free:134848kB min:3860kB low:4824kB high:5788kB active_anon:120628kB inactive_anon:121792kB active_file:653404kB inactive_file:399272kB unevictable:0kB isolated(anon):0kB isolated(file):36kB present:1981184kB managed:1900752kB mlocked:0kB dirty:344kB writeback:0kB mapped:1200kB shmem:0kB slab_reclaimable:239900kB slab_unreclaimable:154560kB kernel_stack:43376kB pagetables:1176kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
Nov 17 10:47:17 ceph1 kernel: [2810698.553758] lowmem_reserve[]: 0 0 30180 30180 30180
Nov 17 10:47:17 ceph1 kernel: [2810698.553760] Node 0 Normal free:237232kB min:63684kB low:79604kB high:95524kB active_anon:3075228kB inactive_anon:629052kB active_file:12544336kB inactive_file:12570716kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:31457280kB managed:30905084kB mlocked:0kB dirty:74368kB writeback:3796kB mapped:48516kB shmem:1276kB slab_reclaimable:713684kB slab_unreclaimable:416404kB kernel_stack:40896kB pagetables:25288kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
Nov 17 10:47:17 ceph1 kernel: [2810698.553763] lowmem_reserve[]: 0 0 0 0 0
Nov 17 10:47:17 ceph1 kernel: [2810698.553765] Node 0 DMA: 1*4kB (U) 1*8kB (U) 0*16kB 1*32kB (U) 3*64kB (U) 2*128kB (U) 0*256kB 0*512kB 0*1024kB 1*2048kB (M) 3*4096kB (M) = 14828kB
Nov 17 10:47:17 ceph1 kernel: [2810698.553772] Node 0 DMA32: 10756*4kB (UME) 11442*8kB (UME) 25*16kB (UE) 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 134960kB
Nov 17 10:47:17 ceph1 kernel: [2810698.553777] Node 0 Normal: 59473*4kB (UME) 77*8kB (U) 10*16kB (H) 8*32kB (H) 6*64kB (H) 2*128kB (H) 1*256kB (H) 1*512kB (H) 0*1024kB 0*2048kB 0*4096kB = 240332kB
Nov 17 10:47:17 ceph1 kernel: [2810698.553784] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
Nov 17 10:47:17 ceph1 kernel: [2810698.553784] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
Nov 17 10:47:17 ceph1 kernel: [2810698.553785] 6544612 total pagecache pages
Nov 17 10:47:17 ceph1 kernel: [2810698.553786] 2347 pages in swap cache
Nov 17 10:47:17 ceph1 kernel: [2810698.553787] Swap cache stats: add 2697126, delete 2694779, find 38874122/39241548
Nov 17 10:47:17 ceph1 kernel: [2810698.553788] Free swap  = 61498092kB
Nov 17 10:47:17 ceph1 kernel: [2810698.553788] Total swap = 62498812kB
Nov 17 10:47:17 ceph1 kernel: [2810698.553789] 8363615 pages RAM
Nov 17 10:47:17 ceph1 kernel: [2810698.553790] 0 pages HighMem/MovableOnly
Nov 17 10:47:17 ceph1 kernel: [2810698.553790] 158182 pages reserved
Nov 17 10:47:17 ceph1 kernel: [2810698.553791] 0 pages cma reserved

Is it relate to page caches?

2017-11-18 7:22 GMT+07:00 Sam Huracan <nowitzki.sammy@xxxxxxxxx>:
Today, one of our Ceph OSDs was down, I've check syslog and see this OSD process was killed by OMM


Nov 17 10:01:06 ceph1 kernel: [2807926.762304] Out of memory: Kill process 3330 (ceph-osd) score 7 or sacrifice child
Nov 17 10:01:06 ceph1 kernel: [2807926.763745] Killed process 3330 (ceph-osd) total-vm:2372392kB, anon-rss:559084kB, file-rss:7268kB
Nov 17 10:01:06 ceph1 kernel: [2807926.985830] init: ceph-osd (ceph/6) main process (3330) killed by KILL signal
Nov 17 10:01:06 ceph1 kernel: [2807926.985844] init: ceph-osd (ceph/6) main process ended, respawning
Nov 17 10:03:39 ceph1 bash: root [15524]: sudo ceph health detail [0]
Nov 17 10:03:40 ceph1 bash: root [15524]: sudo ceph health detail [0]
Nov 17 10:17:01 ceph1 CRON[75167]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
Nov 17 10:47:17 ceph1 kernel: [2810698.553690] ceph-status.sh invoked oom-killer: gfp_mask=0x26000c0, order=2, oom_score_adj=0
Nov 17 10:47:17 ceph1 kernel: [2810698.553693] ceph-status.sh cpuset=/ mems_allowed=0
Nov 17 10:47:17 ceph1 kernel: [2810698.553697] CPU: 5 PID: 194271 Comm: ceph-status.sh Not tainted 4.4.0-62-generic #83~14.04.1-Ubuntu
Nov 17 10:47:17 ceph1 kernel: [2810698.553698] Hardware name: Dell Inc. PowerEdge R730xd/072T6D, BIOS 2.5.5 08/16/2017
Nov 17 10:47:17 ceph1 kernel: [2810698.553699]  0000000000000000 ffff88001a857b38 ffffffff813dc4ac ffff88001a857cf0
Nov 17 10:47:17 ceph1 kernel: [2810698.553701]  0000000000000000 ffff88001a857bc8 ffffffff811fb066 ffff88083d29e200
Nov 17 10:47:17 ceph1 kernel: [2810698.553703]  ffff88001a857cf0 ffff88001a857c00 ffff8808453cf000 0000000000000000
Nov 17 10:47:17 ceph1 kernel: [2810698.553704] Call Trace:
Nov 17 10:47:17 ceph1 kernel: [2810698.553709]  [<ffffffff813dc4ac>] dump_stack+0x63/0x87
Nov 17 10:47:17 ceph1 kernel: [2810698.553713]  [<ffffffff811fb066>] dump_header+0x5b/0x1d5
Nov 17 10:47:17 ceph1 kernel: [2810698.553717]  [<ffffffff81188b75>] oom_kill_process+0x205/0x3d0
Nov 17 10:47:17 ceph1 kernel: [2810698.553718]  [<ffffffff8118857e>] ? oom_unkillable_task+0x9e/0xc0
Nov 17 10:47:17 ceph1 kernel: [2810698.553720]  [<ffffffff811891ab>] out_of_memory+0x40b/0x460
Nov 17 10:47:17 ceph1 kernel: [2810698.553722]  [<ffffffff811fbb1f>] __alloc_pages_slowpath.constprop.87+0x742/0x7ad
Nov 17 10:47:17 ceph1 kernel: [2810698.553725]  [<ffffffff8118e1a7>] __alloc_pages_nodemask+0x237/0x240
Nov 17 10:47:17 ceph1 kernel: [2810698.553727]  [<ffffffff8118e36d>] alloc_kmem_pages_node+0x4d/0xd0
Nov 17 10:47:17 ceph1 kernel: [2810698.553730]  [<ffffffff8107c125>] copy_process+0x185/0x1ce0
Nov 17 10:47:17 ceph1 kernel: [2810698.553732]  [<ffffffff813320f3>] ? security_file_alloc+0x33/0x50
Nov 17 10:47:17 ceph1 kernel: [2810698.553734]  [<ffffffff8107de1a>] _do_fork+0x8a/0x310
Nov 17 10:47:17 ceph1 kernel: [2810698.553737]  [<ffffffff8108dc01>] ? sigprocmask+0x51/0x80
Nov 17 10:47:17 ceph1 kernel: [2810698.553738]  [<ffffffff8107e149>] SyS_clone+0x19/0x20
Nov 17 10:47:17 ceph1 kernel: [2810698.553743]  [<ffffffff81802d36>] entry_SYSCALL_64_fastpath+0x16/0x75
Nov 17 10:47:17 ceph1 kernel: [2810698.553744] Mem-Info:
Nov 17 10:47:17 ceph1 kernel: [2810698.553747] active_anon:798964 inactive_anon:187711 isolated_anon:0
Nov 17 10:47:17 ceph1 kernel: [2810698.553747]  active_file:3299435 inactive_file:3242497 isolated_file:9
Nov 17 10:47:17 ceph1 kernel: [2810698.553747]  unevictable:0 dirty:18678 writeback:949 unstable:0
Nov 17 10:47:17 ceph1 kernel: [2810698.553747]  slab_reclaimable:238396 slab_unreclaimable:142741
Nov 17 10:47:17 ceph1 kernel: [2810698.553747]  mapped:12429 shmem:319 pagetables:6616 bounce:0
Nov 17 10:47:17 ceph1 kernel: [2810698.553747]  free:96727 free_pcp:0 free_cma:0
Nov 17 10:47:17 ceph1 kernel: [2810698.553749] Node 0 DMA free:14828kB min:32kB low:40kB high:48kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15996kB managed:15896kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
Nov 17 10:47:17 ceph1 kernel: [2810698.553753] lowmem_reserve[]: 0 1830 32011 32011 32011
Nov 17 10:47:17 ceph1 kernel: [2810698.553755] Node 0 DMA32 free:134848kB min:3860kB low:4824kB high:5788kB active_anon:120628kB inactive_anon:121792kB active_file:653404kB inactive_file:399272kB unevictable:0kB isolated(anon):0kB isolated(file):36kB present:1981184kB managed:1900752kB mlocked:0kB dirty:344kB writeback:0kB mapped:1200kB shmem:0kB slab_reclaimable:239900kB slab_unreclaimable:154560kB kernel_stack:43376kB pagetables:1176kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no


We check free Memory, Cache is free 24,705 MB

root@ceph1:/var/log# free -m
             total       used       free     shared    buffers     cached
Mem:         32052      31729        323          1        125      24256
-/+ buffers/cache:       7347      24705
Swap:        61033        120      60913



It's so weird. Could you help me solve this problem, I'm afraid it 'll come again.

Thanks in advance.





_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux