Re: Kernel Bug on Linux 4.1.31, possibly nilfs, not sure

Michael Conrad <mconrad@xxxxxxxxxxxxxxx> · Mon, 28 Nov 2016 13:02:10 -0500

Hello, I found a bug that might be related.

https://patchwork.kernel.org/patch/9339843/

On several of my crash dumps, I saw it die in 
"memcg_drain_all_list_lrus" and found this mentioned in a bug where FUSE 
was calling functions to modify the radix tree directly in a way that 
other kernel developers weren't expecting.  (I don't fully understand 
the details)

Our systems that are crashing are not using the FUSE driver for 
anything, but the crashes still coincide with nilfs2 activity.  Is there 
a chance nilfs is affected by this same situation?

I've attached another example crash log.  We're still on 4.1.31, but 
will move to 4.1.35 soon.

-Mike

On 9/28/2016 4:59 PM, Michael Conrad wrote:
Hi, I've started getting "kernel bug" messages on a few systems. At 
first I wasn't sure if it was related to faulty hardware or not.  The 
kernel stack traces do not always mention nilfs, but they roughly 
relate to the times when we make nilfs snapshots of our system (rsync 
from ext4 into nilfs).

We use the same kernel image on several dozen systems, and aside from 
one server, there have only been two crashes like this in the past two 
months.  (that "one server" though has crashed about four or five 
times, which is why I suspected hardware at first)

However I just had one happen on a Linode, which is pretty reliable 
hardware, so I figured I'd post and see what people think.  The end of 
the kernel log is attached.

I'm worried that something is corrupting kernel memory, and then 
causing crashes in un-related parts of the kernel.  I'm really not 
sure how to narrow it down other than turning off the nilfs snapshots 
and see if it continues to happen, though then i have to come up with 
another backup solution in the mean time.

It's worth noting that the server crashing frequently also has the 
largest tree of files.  Also, the nilfs filesystems on these systems 
date back to various versions of nilfs from the 3.* kernel line. It's 
possible that an old bug is lurking on the filesystem structure, but I 
don't believe nilfs has a check tool yet, correct?

Thanks,
Michael Conrad

[ 1907.431921] vlan1: port 3(tap-pcm0) entered forwarding state
[ 1978.778732] perf interrupt took too long (2509 > 2495), lowering kernel.perf_event_max_sample_rate to 50100
[ 3467.694580] NILFS warning: mounting unchecked fs
[ 3469.231826] NILFS: recovery complete.
[ 3469.235374] segctord starting. Construction interval = 5 seconds, CP frequency < 30 seconds
[3464133.533497] perf interrupt took too long (5088 > 4960), lowering kernel.perf_event_max_sample_rate to 25200
[5185591.358403] ------------[ cut here ]------------
[5185591.358520] Kernel BUG at ffffffff802d1212 [verbose debug info unavailable]
[5185591.358613] invalid opcode: 0000 [#1] SMP 
[5185591.358760] Modules linked in: twofish_generic twofish_common camellia_generic serpent_generic blowfish_generic blowfish_common xcbc sha512_generic coretemp asix r8169 8139too natsemi tg3 bn
x2 3c59x tulip e100 e1000 e1000e vmxnet3
[5185591.359655] CPU: 1 PID: 949 Comm: kswapd0 Not tainted 4.1.22 #13
[5185591.359744] Hardware name: Dell Inc.                 OptiPlex 745                 /0MM599, BIOS 2.4.1  08/21/2007
[5185591.359884] task: ffff880126802d40 ti: ffff880121070000 task.ti: ffff880121070000
[5185591.360028] RIP: 0010:[<ffffffff802d1212>]  [<ffffffff802d1212>] shadow_lru_isolate+0x5a/0x146
[5185591.360209] RSP: 0000:ffff880121073c38  EFLAGS: 00210002
[5185591.360300] RAX: 0000000000000000 RBX: ffff880068708b60 RCX: 000000000000faf8
[5185591.360440] RDX: 0000000000000104 RSI: ffff880068708b60 RDI: ffff8801269bafc0
[5185591.360585] RBP: ffff880052a83bb8 R08: 0000000000000000 R09: ffff880121073d00
[5185591.360740] R10: fffffffffffffff9 R11: 0000000000018f00 R12: ffff8801269bafc0
[5185591.360879] R13: ffff880052a83bd0 R14: ffff8801269bafc8 R15: ffff88007ac10920
[5185591.361054] FS:  0000000000000000(0000) GS:ffff88012bc40000(0000) knlGS:0000000000000000
[5185591.361191] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[5185591.361287] CR2: 00000000f77d7000 CR3: 00000000a7672000 CR4: 00000000000006e0
[5185591.361418] Stack:
[5185591.361498]  ffff8801269bafc0 ffff880068708b60 ffff8801269bafc8 000000000000003d
[5185591.361533]  ffffffff802d11b8 ffffffff802d0c45 0000000000000000 ffff880121073d00
[5185591.361533]  00000000ffffffff 0000000000000082 0000000000000000 0000000000006383
[5185591.361533] Call Trace:
[5185591.361533]  [<ffffffff802d11b8>] ? memcg_drain_all_list_lrus+0x11c/0x11c
[5185591.361533]  [<ffffffff802d0c45>] ? __list_lru_walk_one.isra.5+0x8a/0x114
[5185591.361533]  [<ffffffff802d132a>] ? scan_shadow_nodes+0x2c/0x3b
[5185591.361533]  [<ffffffff802c443f>] ? shrink_slab+0x1b4/0x22a
[5185591.361533]  [<ffffffff802c6c95>] ? shrink_zone+0x168/0x1b7
[5185591.361533]  [<ffffffff802c76be>] ? kswapd+0x592/0x6fc
[5185591.361533]  [<ffffffff802c712c>] ? mem_cgroup_shrink_node_zone+0xa9/0xa9
[5185591.361533]  [<ffffffff8025afd1>] ? kthread+0xc5/0xcd
[5185591.361533]  [<ffffffff8025af0c>] ? kthread_create_on_node+0x163/0x163
[5185591.361533]  [<ffffffff80d6f422>] ? ret_from_fork+0x42/0x70
[5185591.361533]  [<ffffffff8025af0c>] ? kthread_create_on_node+0x163/0x163
[5185591.361533] Code: da a9 00 e9 e9 00 00 00 48 89 de 4c 89 f7 e8 3e f7 ff ff 4c 89 e7 e8 6c da a9 00 8b 53 ec 85 d2 75 02 0f 0b 31 c0 80 e2 7f 74 15 <0f> 0b 48 ff ca 48 89 55 58 48 83 c0 08 48
 3d 00 02 00 00 74 35 
[5185591.361533] RIP  [<ffffffff802d1212>] shadow_lru_isolate+0x5a/0x146
[5185591.361533]  RSP <ffff880121073c38>
[5185591.361533] ---[ end trace 7508b90725dc8b72 ]---
[5187529.914580] INFO: rcu_sched detected stalls on CPUs/tasks:
[5187529.914580]        1: (0 ticks this GP) idle=c99/140000000000000/0 softirq=695513032/695513032 fqs=8996 
[5187529.914580]        (detected by 2, t=18002 jiffies, g=281474919, c=281474918, q=511)
[5187529.914580] Task dump for CPU 1:
[5187529.914580] segctord        R  running task        0 23362      2 0x00000008
[5187529.914580]  ffffffff80d6ab5c 0000000000000002 ffff880044d2c000 ffff88012bc55e00
[5187529.914580]  7fffffffffffffff ffffffff8054415d 0000000000000002 ffff880024c33ef8
[5187529.914580]  ffffffff802b9a59 0000000000000052 0000000000244329 0000000000000007
[5187529.914580] Call Trace:
[5187529.914580]  [<ffffffff80d6ab5c>] ? __schedule+0x4a2/0x6ba
[5187529.914580]  [<ffffffff8054415d>] ? radix_tree_lookup_slot+0x10/0x24
[5187529.914580]  [<ffffffff802b9a59>] ? find_get_entry+0x15/0x63
[5187529.914580]  [<ffffffff802b9b6d>] ? pagecache_get_page+0x74/0x15c
[5187529.914580]  [<ffffffff80458a81>] ? nilfs_grab_buffer+0xa2/0xd9
[5187529.914580]  [<ffffffff8045961e>] ? nilfs_mdt_submit_block+0x114/0x134
[5187529.914580]  [<ffffffff80459660>] ? nilfs_mdt_read_block+0x22/0xd5
[5187529.914580]  [<ffffffff804597da>] ? nilfs_mdt_get_block+0x46/0x1d6
[5187529.914580]  [<ffffffff804686b5>] ? nilfs_palloc_get_block+0xa8/0xb9
[5187529.914580]  [<ffffffff80468784>] ? nilfs_palloc_get_bitmap_block+0x3d/0x44
[5187529.914580]  [<ffffffff804685ef>] ? nilfs_palloc_group_desc_add_entries+0x2d/0x4b
[5187529.914580]  [<ffffffff80468c50>] ? nilfs_palloc_prepare_alloc_entry+0x2b5/0x348
[5187529.914580]  [<ffffffff804686b5>] ? nilfs_palloc_get_block+0xa8/0xb9
[5187529.914580]  [<ffffffff8046888e>] ? nilfs_palloc_get_entry_block+0x65/0x6e
[5187529.914580]  [<ffffffff8045a344>] ? nilfs_btnode_prepare_change_key+0x88/0x14d
[5187529.914580]  [<ffffffff8045b230>] ? nilfs_btree_prepare_update_v+0xa0/0xc5
[5187529.914580]  [<ffffffff80462855>] ? nilfs_collect_dat_data+0x4a/0x4a
[5187529.914580]  [<ffffffff8045c870>] ? nilfs_btree_propagate+0x190/0x2c4
[5187529.914580]  [<ffffffff80462855>] ? nilfs_collect_dat_data+0x4a/0x4a
[5187529.914580]  [<ffffffff8045aa15>] ? nilfs_bmap_propagate+0x2b/0x51
[5187529.914580]  [<ffffffff80462872>] ? nilfs_collect_file_data+0x1d/0x4a
[5187529.914580]  [<ffffffff804618c4>] ? nilfs_segctor_apply_buffers+0x5a/0xc0
[5187529.914580]  [<ffffffff80462698>] ? nilfs_segctor_scan_file+0xb6/0x139
[5187529.914580]  [<ffffffff80462f1c>] ? nilfs_segctor_do_construct+0x5f9/0x1808
[5187529.914580]  [<ffffffff804642db>] ? nilfs_segctor_construct+0x8f/0x216
[5187529.914580]  [<ffffffff804645b8>] ? nilfs_segctor_thread+0x156/0x2fd
[5187529.914580]  [<ffffffff80464462>] ? nilfs_segctor_construct+0x216/0x216
[5187529.914580]  [<ffffffff8025afd1>] ? kthread+0xc5/0xcd
[5187529.914580]  [<ffffffff8025af0c>] ? kthread_create_on_node+0x163/0x163
[5187529.914580]  [<ffffffff80d6f422>] ? ret_from_fork+0x42/0x70
[5187529.914580]  [<ffffffff8025af0c>] ? kthread_create_on_node+0x163/0x163
[5187571.161942] Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 1