Hello, I found a bug that might be related.
https://patchwork.kernel.org/patch/9339843/
On several of my crash dumps, I saw it die in
"memcg_drain_all_list_lrus" and found this mentioned in a bug where FUSE
was calling functions to modify the radix tree directly in a way that
other kernel developers weren't expecting. (I don't fully understand
the details)
Our systems that are crashing are not using the FUSE driver for
anything, but the crashes still coincide with nilfs2 activity. Is there
a chance nilfs is affected by this same situation?
I've attached another example crash log. We're still on 4.1.31, but
will move to 4.1.35 soon.
-Mike
On 9/28/2016 4:59 PM, Michael Conrad wrote:
Hi, I've started getting "kernel bug" messages on a few systems. At
first I wasn't sure if it was related to faulty hardware or not. The
kernel stack traces do not always mention nilfs, but they roughly
relate to the times when we make nilfs snapshots of our system (rsync
from ext4 into nilfs).
We use the same kernel image on several dozen systems, and aside from
one server, there have only been two crashes like this in the past two
months. (that "one server" though has crashed about four or five
times, which is why I suspected hardware at first)
However I just had one happen on a Linode, which is pretty reliable
hardware, so I figured I'd post and see what people think. The end of
the kernel log is attached.
I'm worried that something is corrupting kernel memory, and then
causing crashes in un-related parts of the kernel. I'm really not
sure how to narrow it down other than turning off the nilfs snapshots
and see if it continues to happen, though then i have to come up with
another backup solution in the mean time.
It's worth noting that the server crashing frequently also has the
largest tree of files. Also, the nilfs filesystems on these systems
date back to various versions of nilfs from the 3.* kernel line. It's
possible that an old bug is lurking on the filesystem structure, but I
don't believe nilfs has a check tool yet, correct?
Thanks,
Michael Conrad
[ 1907.431921] vlan1: port 3(tap-pcm0) entered forwarding state
[ 1978.778732] perf interrupt took too long (2509 > 2495), lowering kernel.perf_event_max_sample_rate to 50100
[ 3467.694580] NILFS warning: mounting unchecked fs
[ 3469.231826] NILFS: recovery complete.
[ 3469.235374] segctord starting. Construction interval = 5 seconds, CP frequency < 30 seconds
[3464133.533497] perf interrupt took too long (5088 > 4960), lowering kernel.perf_event_max_sample_rate to 25200
[5185591.358403] ------------[ cut here ]------------
[5185591.358520] Kernel BUG at ffffffff802d1212 [verbose debug info unavailable]
[5185591.358613] invalid opcode: 0000 [#1] SMP
[5185591.358760] Modules linked in: twofish_generic twofish_common camellia_generic serpent_generic blowfish_generic blowfish_common xcbc sha512_generic coretemp asix r8169 8139too natsemi tg3 bn
x2 3c59x tulip e100 e1000 e1000e vmxnet3
[5185591.359655] CPU: 1 PID: 949 Comm: kswapd0 Not tainted 4.1.22 #13
[5185591.359744] Hardware name: Dell Inc. OptiPlex 745 /0MM599, BIOS 2.4.1 08/21/2007
[5185591.359884] task: ffff880126802d40 ti: ffff880121070000 task.ti: ffff880121070000
[5185591.360028] RIP: 0010:[<ffffffff802d1212>] [<ffffffff802d1212>] shadow_lru_isolate+0x5a/0x146
[5185591.360209] RSP: 0000:ffff880121073c38 EFLAGS: 00210002
[5185591.360300] RAX: 0000000000000000 RBX: ffff880068708b60 RCX: 000000000000faf8
[5185591.360440] RDX: 0000000000000104 RSI: ffff880068708b60 RDI: ffff8801269bafc0
[5185591.360585] RBP: ffff880052a83bb8 R08: 0000000000000000 R09: ffff880121073d00
[5185591.360740] R10: fffffffffffffff9 R11: 0000000000018f00 R12: ffff8801269bafc0
[5185591.360879] R13: ffff880052a83bd0 R14: ffff8801269bafc8 R15: ffff88007ac10920
[5185591.361054] FS: 0000000000000000(0000) GS:ffff88012bc40000(0000) knlGS:0000000000000000
[5185591.361191] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[5185591.361287] CR2: 00000000f77d7000 CR3: 00000000a7672000 CR4: 00000000000006e0
[5185591.361418] Stack:
[5185591.361498] ffff8801269bafc0 ffff880068708b60 ffff8801269bafc8 000000000000003d
[5185591.361533] ffffffff802d11b8 ffffffff802d0c45 0000000000000000 ffff880121073d00
[5185591.361533] 00000000ffffffff 0000000000000082 0000000000000000 0000000000006383
[5185591.361533] Call Trace:
[5185591.361533] [<ffffffff802d11b8>] ? memcg_drain_all_list_lrus+0x11c/0x11c
[5185591.361533] [<ffffffff802d0c45>] ? __list_lru_walk_one.isra.5+0x8a/0x114
[5185591.361533] [<ffffffff802d132a>] ? scan_shadow_nodes+0x2c/0x3b
[5185591.361533] [<ffffffff802c443f>] ? shrink_slab+0x1b4/0x22a
[5185591.361533] [<ffffffff802c6c95>] ? shrink_zone+0x168/0x1b7
[5185591.361533] [<ffffffff802c76be>] ? kswapd+0x592/0x6fc
[5185591.361533] [<ffffffff802c712c>] ? mem_cgroup_shrink_node_zone+0xa9/0xa9
[5185591.361533] [<ffffffff8025afd1>] ? kthread+0xc5/0xcd
[5185591.361533] [<ffffffff8025af0c>] ? kthread_create_on_node+0x163/0x163
[5185591.361533] [<ffffffff80d6f422>] ? ret_from_fork+0x42/0x70
[5185591.361533] [<ffffffff8025af0c>] ? kthread_create_on_node+0x163/0x163
[5185591.361533] Code: da a9 00 e9 e9 00 00 00 48 89 de 4c 89 f7 e8 3e f7 ff ff 4c 89 e7 e8 6c da a9 00 8b 53 ec 85 d2 75 02 0f 0b 31 c0 80 e2 7f 74 15 <0f> 0b 48 ff ca 48 89 55 58 48 83 c0 08 48
3d 00 02 00 00 74 35
[5185591.361533] RIP [<ffffffff802d1212>] shadow_lru_isolate+0x5a/0x146
[5185591.361533] RSP <ffff880121073c38>
[5185591.361533] ---[ end trace 7508b90725dc8b72 ]---
[5187529.914580] INFO: rcu_sched detected stalls on CPUs/tasks:
[5187529.914580] 1: (0 ticks this GP) idle=c99/140000000000000/0 softirq=695513032/695513032 fqs=8996
[5187529.914580] (detected by 2, t=18002 jiffies, g=281474919, c=281474918, q=511)
[5187529.914580] Task dump for CPU 1:
[5187529.914580] segctord R running task 0 23362 2 0x00000008
[5187529.914580] ffffffff80d6ab5c 0000000000000002 ffff880044d2c000 ffff88012bc55e00
[5187529.914580] 7fffffffffffffff ffffffff8054415d 0000000000000002 ffff880024c33ef8
[5187529.914580] ffffffff802b9a59 0000000000000052 0000000000244329 0000000000000007
[5187529.914580] Call Trace:
[5187529.914580] [<ffffffff80d6ab5c>] ? __schedule+0x4a2/0x6ba
[5187529.914580] [<ffffffff8054415d>] ? radix_tree_lookup_slot+0x10/0x24
[5187529.914580] [<ffffffff802b9a59>] ? find_get_entry+0x15/0x63
[5187529.914580] [<ffffffff802b9b6d>] ? pagecache_get_page+0x74/0x15c
[5187529.914580] [<ffffffff80458a81>] ? nilfs_grab_buffer+0xa2/0xd9
[5187529.914580] [<ffffffff8045961e>] ? nilfs_mdt_submit_block+0x114/0x134
[5187529.914580] [<ffffffff80459660>] ? nilfs_mdt_read_block+0x22/0xd5
[5187529.914580] [<ffffffff804597da>] ? nilfs_mdt_get_block+0x46/0x1d6
[5187529.914580] [<ffffffff804686b5>] ? nilfs_palloc_get_block+0xa8/0xb9
[5187529.914580] [<ffffffff80468784>] ? nilfs_palloc_get_bitmap_block+0x3d/0x44
[5187529.914580] [<ffffffff804685ef>] ? nilfs_palloc_group_desc_add_entries+0x2d/0x4b
[5187529.914580] [<ffffffff80468c50>] ? nilfs_palloc_prepare_alloc_entry+0x2b5/0x348
[5187529.914580] [<ffffffff804686b5>] ? nilfs_palloc_get_block+0xa8/0xb9
[5187529.914580] [<ffffffff8046888e>] ? nilfs_palloc_get_entry_block+0x65/0x6e
[5187529.914580] [<ffffffff8045a344>] ? nilfs_btnode_prepare_change_key+0x88/0x14d
[5187529.914580] [<ffffffff8045b230>] ? nilfs_btree_prepare_update_v+0xa0/0xc5
[5187529.914580] [<ffffffff80462855>] ? nilfs_collect_dat_data+0x4a/0x4a
[5187529.914580] [<ffffffff8045c870>] ? nilfs_btree_propagate+0x190/0x2c4
[5187529.914580] [<ffffffff80462855>] ? nilfs_collect_dat_data+0x4a/0x4a
[5187529.914580] [<ffffffff8045aa15>] ? nilfs_bmap_propagate+0x2b/0x51
[5187529.914580] [<ffffffff80462872>] ? nilfs_collect_file_data+0x1d/0x4a
[5187529.914580] [<ffffffff804618c4>] ? nilfs_segctor_apply_buffers+0x5a/0xc0
[5187529.914580] [<ffffffff80462698>] ? nilfs_segctor_scan_file+0xb6/0x139
[5187529.914580] [<ffffffff80462f1c>] ? nilfs_segctor_do_construct+0x5f9/0x1808
[5187529.914580] [<ffffffff804642db>] ? nilfs_segctor_construct+0x8f/0x216
[5187529.914580] [<ffffffff804645b8>] ? nilfs_segctor_thread+0x156/0x2fd
[5187529.914580] [<ffffffff80464462>] ? nilfs_segctor_construct+0x216/0x216
[5187529.914580] [<ffffffff8025afd1>] ? kthread+0xc5/0xcd
[5187529.914580] [<ffffffff8025af0c>] ? kthread_create_on_node+0x163/0x163
[5187529.914580] [<ffffffff80d6f422>] ? ret_from_fork+0x42/0x70
[5187529.914580] [<ffffffff8025af0c>] ? kthread_create_on_node+0x163/0x163
[5187571.161942] Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 1