We have been error free for almost 3 weeks now. The following settings on all OSD nodes were changed:
vm.swappiness=1
vm.min_free_kbytes=262144
My discussion on XFS list is here: http://www.spinics.net/lists/xfs/msg33645.html
Thanks,
Alex
On Fri, Jul 3, 2015 at 6:27 AM, Jan Schermer <jan@xxxxxxxxxxx> wrote:
What’s the value of /proc/sys/vm/min_free_kbytes on your system? Increase it to 256M (better do it if there’s lots of free memory) and see if it helps.It can also be set too high, hard to find any formula how to set it correctly...JanOn 03 Jul 2015, at 10:16, Alex Gorbachev <ag@xxxxxxxxxxxxxxxxxxx> wrote:_______________________________________________Hello, we are experiencing severe OSD timeouts, OSDs are not taken out and we see the following in syslog on Ubuntu 14.04.2 with Firefly 0.80.9.Thank you for any advice.AlexJul 3 03:42:06 roc-4r-sca020 kernel: [554036.261899] BUG: unable to handle kernel paging request at 000000190000001cJul 3 03:42:06 roc-4r-sca020 kernel: [554036.261923] IP: [<ffffffff8118e476>] find_get_entries+0x66/0x160Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.261941] PGD 1035954067 PUD 0Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.261955] Oops: 0000 [#1] SMPJul 3 03:42:06 roc-4r-sca020 kernel: [554036.261969] Modules linked in: xfs libcrc32c ipmi_ssif intel_rapl iosf_mbi x86_pkg_temp_thermal intel_powerclamp coretemp kvm crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd sb_edac edac_core lpc_ich joydev mei_me mei ioatdma wmi 8021q ipmi_si garp 8250_fintek mrp ipmi_msghandler stp llc bonding mac_hid lp parport mlx4_en vxlan ip6_udp_tunnel udp_tunnel hid_generic usbhid hid igb ahci mpt2sas mlx4_core i2c_algo_bit libahci dca raid_class ptp scsi_transport_sas pps_core arcmsrJul 3 03:42:06 roc-4r-sca020 kernel: [554036.262182] CPU: 10 PID: 8711 Comm: ceph-osd Not tainted 4.1.0-040100-generic #201506220235Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.262197] Hardware name: Supermicro X9DRD-7LN4F(-JBOD)/X9DRD-EF/X9DRD-7LN4F, BIOS 3.0a 12/05/2013Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.262215] task: ffff8800721f1420 ti: ffff880fbad54000 task.ti: ffff880fbad54000Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.262229] RIP: 0010:[<ffffffff8118e476>] [<ffffffff8118e476>] find_get_entries+0x66/0x160Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.262248] RSP: 0018:ffff880fbad571a8 EFLAGS: 00010246Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.262258] RAX: ffff880004000158 RBX: 000000000000000e RCX: 0000000000000000Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.262303] RDX: ffff880004000158 RSI: ffff880fbad571c0 RDI: 0000001900000000Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.262347] RBP: ffff880fbad57208 R08: 00000000000000c0 R09: 00000000000000ffJul 3 03:42:06 roc-4r-sca020 kernel: [554036.262391] R10: 0000000000000000 R11: 0000000000000220 R12: 00000000000000b6Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.262435] R13: ffff880fbad57268 R14: 000000000000000a R15: ffff880fbad572d8Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.262479] FS: 00007f98cb0e0700(0000) GS:ffff88103f480000(0000) knlGS:0000000000000000Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.262524] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.262551] CR2: 000000190000001c CR3: 0000001034f0e000 CR4: 00000000000407e0Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.262596] Stack:Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.262618] ffff880fbad571f8 ffff880cf6076b30 ffff880bdde05da8 00000000000000e6Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.262669] 0000000000000100 ffff880cf6076b28 00000000000000b5 ffff880fbad57258Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.262721] ffff880fbad57258 ffff880fbad572d8 ffffffffffffffff ffff880cf6076b28Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.262772] Call Trace:Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.262801] [<ffffffff8119b482>] pagevec_lookup_entries+0x22/0x30Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.262831] [<ffffffff8119bd84>] truncate_inode_pages_range+0xf4/0x700Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.262862] [<ffffffff8119c415>] truncate_inode_pages+0x15/0x20Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.262891] [<ffffffff8119c53f>] truncate_inode_pages_final+0x5f/0xa0Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.262949] [<ffffffffc0431c2c>] xfs_fs_evict_inode+0x3c/0xe0 [xfs]Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.262981] [<ffffffff81220558>] evict+0xb8/0x190Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.263009] [<ffffffff81220671>] dispose_list+0x41/0x50Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.263037] [<ffffffff8122176f>] prune_icache_sb+0x4f/0x60Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.263067] [<ffffffff81208ab5>] super_cache_scan+0x155/0x1a0Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.263096] [<ffffffff8119d26f>] do_shrink_slab+0x13f/0x2c0Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.263126] [<ffffffff811a22b0>] ? shrink_lruvec+0x330/0x370Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.263157] [<ffffffff811b4189>] ? isolate_migratepages_block+0x299/0x5c0Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.263188] [<ffffffff8119d558>] shrink_slab+0xd8/0x110Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.263217] [<ffffffff811a25bf>] shrink_zone+0x2cf/0x300Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.263246] [<ffffffff811b4d3d>] ? compact_zone+0x7d/0x4f0Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.263275] [<ffffffff811a2a64>] shrink_zones+0x104/0x2a0Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.263304] [<ffffffff811b53ad>] ? compact_zone_order+0x5d/0x70Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.263336] [<ffffffff810f1666>] ? ktime_get+0x46/0xb0Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.263365] [<ffffffff811a2cd7>] do_try_to_free_pages+0xd7/0x160Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.263396] [<ffffffff811a3017>] try_to_free_pages+0xb7/0x170Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.263427] [<ffffffff8119571a>] __alloc_pages_nodemask+0x5ba/0x9c0Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.263460] [<ffffffff811dc9bc>] alloc_pages_current+0x9c/0x110Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.263492] [<ffffffff811e4f2a>] allocate_slab+0x20a/0x2e0Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.263522] [<ffffffff811e5031>] new_slab+0x31/0x1f0Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.263553] [<ffffffff817f8dd9>] __slab_alloc+0x18e/0x2a3Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.263584] [<ffffffff816d7817>] ? __alloc_skb+0x87/0x2b0Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.263614] [<ffffffff816d77e7>] ? __alloc_skb+0x57/0x2b0Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.263643] [<ffffffff811e9b7b>] __kmalloc_node_track_caller+0xbb/0x2b0Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.263675] [<ffffffff816d7817>] ? __alloc_skb+0x87/0x2b0Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.263704] [<ffffffff816d737c>] __kmalloc_reserve.isra.57+0x3c/0xa0Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.263734] [<ffffffff816d7817>] __alloc_skb+0x87/0x2b0Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.263766] [<ffffffff81737de1>] sk_stream_alloc_skb+0x41/0x130Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.263796] [<ffffffff817388b3>] tcp_sendmsg+0x2d3/0xa90Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.263827] [<ffffffff81764477>] inet_sendmsg+0x67/0xa0Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.263858] [<ffffffff816cea54>] ? copy_msghdr_from_user+0x154/0x1b0Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.263891] [<ffffffff816cdcfd>] sock_sendmsg+0x4d/0x60Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.263920] [<ffffffff816cef93>] ___sys_sendmsg+0x2b3/0x2c0Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.263950] [<ffffffff810a853c>] ? ttwu_do_wakeup+0x2c/0x100Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.263979] [<ffffffff810a8826>] ? ttwu_do_activate.constprop.121+0x66/0x70Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.264011] [<ffffffff810abef5>] ? try_to_wake_up+0x215/0x2a0Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.264040] [<ffffffff810abfb0>] ? wake_up_state+0x10/0x20Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.264071] [<ffffffff810fce86>] ? wake_futex+0x76/0xb0Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.264099] [<ffffffff810fe192>] ? futex_wake+0x72/0x140Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.264127] [<ffffffff81222675>] ? __fget_light+0x25/0x70Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.264155] [<ffffffff816cf9b9>] __sys_sendmsg+0x49/0x90Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.264184] [<ffffffff816cfa19>] SyS_sendmsg+0x19/0x20Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.264215] [<ffffffff8180d272>] system_call_fastpath+0x16/0x75Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.264243] Code: 00 4c 89 65 c0 31 d2 e9 86 00 00 00 66 0f 1f 84 00 00 00 00 00 48 8b 3a 48 85 ff 0f 84 ad 00 00 00 40 f6 c7 03 0f 85 a9 00 00 00 <8b> 4f 1c 85 c9 74 e3 8d 71 01 4c 8d 47 1c 89 c8 f0 0f b1 77 1cJul 3 03:42:06 roc-4r-sca020 kernel: [554036.264467] RIP [<ffffffff8118e476>] find_get_entries+0x66/0x160Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.264499] RSP <ffff880fbad571a8>Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.264522] CR2: 000000190000001cJul 3 03:42:06 roc-4r-sca020 kernel: [554036.264824] ---[ end trace ae271fe24c8d817e ]---Jul 3 03:45:01 roc-4r-sca020 CRON[801140]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)Jul 2 06:28:21 roc-4r-sca020 rsyslogd: message repeated 6 times: [ [origin software="rsyslogd" swVersion="7.4.4" x-pid="722" x-info="http://www.rsyslog.com"] rsyslogd was HUPed]
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com