@Eric: How can I check status of fscache? Why can it be root cause?
Thanks
2017-11-18 7:30 GMT+07:00 Eric Nelson <ericnelson@xxxxxxxxx>:
One thing that doesn't show up is fs cache, which is likely the cause here. We went through this on our SSDs and had to add the following to stop the crashes. I believe vm.vfs_cache_pressure and min_free_kbytes were the really helpful things in getting the crashes to stop. HTH!sysctl_param 'vm.vfs_cache_pressure' do
value 400
end
sysctl_param 'vm.dirty_ratio' do
value 20
end
sysctl_param 'vm.dirty_background_ratio' do
value 2
end
sysctl_param 'vm.min_free_kbytes' do
value 4194304
end
On Fri, Nov 17, 2017 at 4:24 PM, Sam Huracan <nowitzki.sammy@xxxxxxxxx> wrote:______________________________I see some more logs about memory in syslog:Nov 17 10:47:17 ceph1 kernel: [2810698.553749] Node 0 DMA free:14828kB min:32kB low:40kB high:48kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15996kB managed:15896kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yesNov 17 10:47:17 ceph1 kernel: [2810698.553753] lowmem_reserve[]: 0 1830 32011 32011 32011Nov 17 10:47:17 ceph1 kernel: [2810698.553755] Node 0 DMA32 free:134848kB min:3860kB low:4824kB high:5788kB active_anon:120628kB inactive_anon:121792kB active_file:653404kB inactive_file:399272kB unevictable:0kB isolated(anon):0kB isolated(file):36kB present:1981184kB managed:1900752kB mlocked:0kB dirty:344kB writeback:0kB mapped:1200kB shmem:0kB slab_reclaimable:239900kB slab_unreclaimable:154560kB kernel_stack:43376kB pagetables:1176kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? noNov 17 10:47:17 ceph1 kernel: [2810698.553758] lowmem_reserve[]: 0 0 30180 30180 30180Nov 17 10:47:17 ceph1 kernel: [2810698.553760] Node 0 Normal free:237232kB min:63684kB low:79604kB high:95524kB active_anon:3075228kB inactive_anon:629052kB active_file:12544336kB inactive_file:12570716kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:31457280kB managed:30905084kB mlocked:0kB dirty:74368kB writeback:3796kB mapped:48516kB shmem:1276kB slab_reclaimable:713684kB slab_unreclaimable:416404kB kernel_stack:40896kB pagetables:25288kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? noNov 17 10:47:17 ceph1 kernel: [2810698.553763] lowmem_reserve[]: 0 0 0 0 0Nov 17 10:47:17 ceph1 kernel: [2810698.553765] Node 0 DMA: 1*4kB (U) 1*8kB (U) 0*16kB 1*32kB (U) 3*64kB (U) 2*128kB (U) 0*256kB 0*512kB 0*1024kB 1*2048kB (M) 3*4096kB (M) = 14828kBNov 17 10:47:17 ceph1 kernel: [2810698.553772] Node 0 DMA32: 10756*4kB (UME) 11442*8kB (UME) 25*16kB (UE) 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 134960kBNov 17 10:47:17 ceph1 kernel: [2810698.553777] Node 0 Normal: 59473*4kB (UME) 77*8kB (U) 10*16kB (H) 8*32kB (H) 6*64kB (H) 2*128kB (H) 1*256kB (H) 1*512kB (H) 0*1024kB 0*2048kB 0*4096kB = 240332kBNov 17 10:47:17 ceph1 kernel: [2810698.553784] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kBNov 17 10:47:17 ceph1 kernel: [2810698.553784] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kBNov 17 10:47:17 ceph1 kernel: [2810698.553785] 6544612 total pagecache pagesNov 17 10:47:17 ceph1 kernel: [2810698.553786] 2347 pages in swap cacheNov 17 10:47:17 ceph1 kernel: [2810698.553787] Swap cache stats: add 2697126, delete 2694779, find 38874122/39241548Nov 17 10:47:17 ceph1 kernel: [2810698.553788] Free swap = 61498092kBNov 17 10:47:17 ceph1 kernel: [2810698.553788] Total swap = 62498812kBNov 17 10:47:17 ceph1 kernel: [2810698.553789] 8363615 pages RAMNov 17 10:47:17 ceph1 kernel: [2810698.553790] 0 pages HighMem/MovableOnlyNov 17 10:47:17 ceph1 kernel: [2810698.553790] 158182 pages reservedNov 17 10:47:17 ceph1 kernel: [2810698.553791] 0 pages cma reservedIs it relate to page caches?2017-11-18 7:22 GMT+07:00 Sam Huracan <nowitzki.sammy@xxxxxxxxx>:Today, one of our Ceph OSDs was down, I've check syslog and see this OSD process was killed by OMMNov 17 10:01:06 ceph1 kernel: [2807926.762304] Out of memory: Kill process 3330 (ceph-osd) score 7 or sacrifice childNov 17 10:01:06 ceph1 kernel: [2807926.763745] Killed process 3330 (ceph-osd) total-vm:2372392kB, anon-rss:559084kB, file-rss:7268kBNov 17 10:01:06 ceph1 kernel: [2807926.985830] init: ceph-osd (ceph/6) main process (3330) killed by KILL signalNov 17 10:01:06 ceph1 kernel: [2807926.985844] init: ceph-osd (ceph/6) main process ended, respawningNov 17 10:03:39 ceph1 bash: root [15524]: sudo ceph health detail [0]Nov 17 10:03:40 ceph1 bash: root [15524]: sudo ceph health detail [0]Nov 17 10:17:01 ceph1 CRON[75167]: (root) CMD ( cd / && run-parts --report /etc/cron.hourly)Nov 17 10:47:17 ceph1 kernel: [2810698.553690] ceph-status.sh invoked oom-killer: gfp_mask=0x26000c0, order=2, oom_score_adj=0Nov 17 10:47:17 ceph1 kernel: [2810698.553693] ceph-status.sh cpuset=/ mems_allowed=0Nov 17 10:47:17 ceph1 kernel: [2810698.553697] CPU: 5 PID: 194271 Comm: ceph-status.sh Not tainted 4.4.0-62-generic #83~14.04.1-UbuntuNov 17 10:47:17 ceph1 kernel: [2810698.553698] Hardware name: Dell Inc. PowerEdge R730xd/072T6D, BIOS 2.5.5 08/16/2017Nov 17 10:47:17 ceph1 kernel: [2810698.553699] 0000000000000000 ffff88001a857b38 ffffffff813dc4ac ffff88001a857cf0Nov 17 10:47:17 ceph1 kernel: [2810698.553701] 0000000000000000 ffff88001a857bc8 ffffffff811fb066 ffff88083d29e200Nov 17 10:47:17 ceph1 kernel: [2810698.553703] ffff88001a857cf0 ffff88001a857c00 ffff8808453cf000 0000000000000000Nov 17 10:47:17 ceph1 kernel: [2810698.553704] Call Trace:Nov 17 10:47:17 ceph1 kernel: [2810698.553709] [<ffffffff813dc4ac>] dump_stack+0x63/0x87Nov 17 10:47:17 ceph1 kernel: [2810698.553713] [<ffffffff811fb066>] dump_header+0x5b/0x1d5Nov 17 10:47:17 ceph1 kernel: [2810698.553717] [<ffffffff81188b75>] oom_kill_process+0x205/0x3d0Nov 17 10:47:17 ceph1 kernel: [2810698.553718] [<ffffffff8118857e>] ? oom_unkillable_task+0x9e/0xc0Nov 17 10:47:17 ceph1 kernel: [2810698.553720] [<ffffffff811891ab>] out_of_memory+0x40b/0x460Nov 17 10:47:17 ceph1 kernel: [2810698.553722] [<ffffffff811fbb1f>] __alloc_pages_slowpath.constprop.87+0x742/0x7ad Nov 17 10:47:17 ceph1 kernel: [2810698.553725] [<ffffffff8118e1a7>] __alloc_pages_nodemask+0x237/0x240 Nov 17 10:47:17 ceph1 kernel: [2810698.553727] [<ffffffff8118e36d>] alloc_kmem_pages_node+0x4d/0xd0 Nov 17 10:47:17 ceph1 kernel: [2810698.553730] [<ffffffff8107c125>] copy_process+0x185/0x1ce0Nov 17 10:47:17 ceph1 kernel: [2810698.553732] [<ffffffff813320f3>] ? security_file_alloc+0x33/0x50Nov 17 10:47:17 ceph1 kernel: [2810698.553734] [<ffffffff8107de1a>] _do_fork+0x8a/0x310Nov 17 10:47:17 ceph1 kernel: [2810698.553737] [<ffffffff8108dc01>] ? sigprocmask+0x51/0x80Nov 17 10:47:17 ceph1 kernel: [2810698.553738] [<ffffffff8107e149>] SyS_clone+0x19/0x20Nov 17 10:47:17 ceph1 kernel: [2810698.553743] [<ffffffff81802d36>] entry_SYSCALL_64_fastpath+0x16/0x75 Nov 17 10:47:17 ceph1 kernel: [2810698.553744] Mem-Info:Nov 17 10:47:17 ceph1 kernel: [2810698.553747] active_anon:798964 inactive_anon:187711 isolated_anon:0Nov 17 10:47:17 ceph1 kernel: [2810698.553747] active_file:3299435 inactive_file:3242497 isolated_file:9Nov 17 10:47:17 ceph1 kernel: [2810698.553747] unevictable:0 dirty:18678 writeback:949 unstable:0Nov 17 10:47:17 ceph1 kernel: [2810698.553747] slab_reclaimable:238396 slab_unreclaimable:142741Nov 17 10:47:17 ceph1 kernel: [2810698.553747] mapped:12429 shmem:319 pagetables:6616 bounce:0Nov 17 10:47:17 ceph1 kernel: [2810698.553747] free:96727 free_pcp:0 free_cma:0Nov 17 10:47:17 ceph1 kernel: [2810698.553749] Node 0 DMA free:14828kB min:32kB low:40kB high:48kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15996kB managed:15896kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yesNov 17 10:47:17 ceph1 kernel: [2810698.553753] lowmem_reserve[]: 0 1830 32011 32011 32011Nov 17 10:47:17 ceph1 kernel: [2810698.553755] Node 0 DMA32 free:134848kB min:3860kB low:4824kB high:5788kB active_anon:120628kB inactive_anon:121792kB active_file:653404kB inactive_file:399272kB unevictable:0kB isolated(anon):0kB isolated(file):36kB present:1981184kB managed:1900752kB mlocked:0kB dirty:344kB writeback:0kB mapped:1200kB shmem:0kB slab_reclaimable:239900kB slab_unreclaimable:154560kB kernel_stack:43376kB pagetables:1176kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? noWe check free Memory, Cache is free 24,705 MBroot@ceph1:/var/log# free -mtotal used free shared buffers cachedMem: 32052 31729 323 1 125 24256-/+ buffers/cache: 7347 24705Swap: 61033 120 60913It's so weird. Could you help me solve this problem, I'm afraid it 'll come again.Thanks in advance._________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com