i_wb_list corruption and possible race

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



I'm trying to understand a problem we've encountered that causes all
disk IO to block indefinitely, requiring a reboot. There have been
about 10 occurrences over the past 6 months among several dozen
systems. They are all dual-socket Xeon E5 v3 (Haswell) boards running
kernel 3.14.51 as dom0 on Xen 4.5.1.

The kernel has grsecurity-3.1-3.14.51-201508181951.patch, some
backported bug fixes mainly in Xen support, and a not-yet-upstreamed
driver for a custom bit of hardware. We haven't had any luck
reproducing the issue on a more recent stock kernel, though this is
not too surprising given the intermittent nature.

The main symptom in all cases is a message like "list_del corruption.
prev->next should be ffff8804a248d018, but was 0000000114f2a8bf",
along with a never-ending series of soft lockup messages. The
corruption is always an inode i_wb_list, and always detected on
list_del. Usually it happens in __mark_inode_dirty, invoked from a vfs
syscall like chown or unlink. The corrupt prev->next is usually some
value from the stack. In about half the cases, the lower 32 bits are a
jiffies value that happens to be 315 seconds earlier than than the
time the message was printed.

With CONFIG_LIST_DEBUG=y, when list_del detects the corruption it
leaves the list untouched. And the grsecurity version of list_del
calls BUG rather than WARN, so the calling process dies, holding a wb
list_lock. The latter behavior is responsible for at least some of the
subsequent lockups, as other processes spin forever waiting for the
list_lock.

Another lockup situation occurs in almost all cases, where a writeback
worker loops forever in move_expired_inodes. This does not appear to
be a side effect of the aborted list_del, but rather a result of the
original list corruption. This looping also occurs while a wb
list_lock is held, possibly causing additional lockups later on.

The log below shows one of these cases. The first sign of trouble is a
writeback worker (pid 1989) complaining about a lockup for 22 sec. It
has already been doing a lot of work in move_expired_inodes, as the
value of moved (in r12) is 0x9ec8f79a. The tmp.next and tmp.prev
pointers on the stack are both 0xffff8804a248d018. The wb pointer (in
rbx) is 0xffff8804bd6691e0. This worker's block device (251:1) is a
dm-crypt volume backed by a file in an ext4 fs on another block device
(a raid1 volume on a pair of SATA SSDs).

The next trace shows pid 1364 spinning trying to grab a wb list_lock.
This is presumably the one held by pid 1989, but I haven't been able
to confirm this based on stack or registers.

The third trace shows pid 855 detecting the list_del corruption, from
__mark_inode_dirty in a chown syscall. The list entry being moved is
0xffff8804a248d018 (seen both in the corruption message and in rbx).
So the inode here is the same one seen in the tmp list in
move_expired_inodes.

[1406104.150779] BUG: soft lockup - CPU#0 stuck for 22s! [kworker/u8:2:1989]
[1406104.150892] Modules linked in: xt_physdev(F) xen_netback(F)
xen_blkback(F) ip_gre(F) ip_tunnel(F) ixgbevf(F) dm_crypt(F)
algif_skcipher(F) af_alg(F) openvswitch(F) vxlan(F) gre(F)
libcrc32c(F) xen_pciback(F) nfsd(F) auth_rpcgss(F) nfs_acl(F) nfs(F)
lockd(F) fscache(F) sunrpc(F) xt_LOG(F) xt_limit(F) xt_tcpudp(F)
nf_conntrack_ipv4(F) nf_defrag_ipv4(F) xt_conntrack(F) nf_conntrack(F)
iptable_filter(F) ip_tables(F) x_tables(F) iTCO_wdt(F)
iTCO_vendor_support(F) gpio_ich(F) raid1(F) x86_pkg_temp_thermal(F)
intel_powerclamp(F) coretemp(F) crc32_pclmul(F) ghash_clmulni_intel(F)
aesni_intel(F) aes_x86_64(F) lrw(F) gf128mul(F) glue_helper(F)
ablk_helper(F) cryptd(F) sb_edac(F) mei_me(F) mei(F) lpc_ich(F)
mfd_core(F) i2c_i801(F) ioatdma(F) squashfs(F) ixgbe(F) mdio(F) ptp(F)
pps_core(F) dca(F) ahci(F) libahci(F) tpm_tis(F)
[1406104.151735] irq event stamp: 188840
[1406104.151781] hardirqs last  enabled at (188839):
[<ffffffff81680dad>] ffffffff81680dad
[1406104.151867] hardirqs last disabled at (188840):
[<ffffffff81681303>] ffffffff81681303
[1406104.151954] softirqs last  enabled at (188838):
[<ffffffff810716d4>] ffffffff810716d4
[1406104.152040] softirqs last disabled at (188833):
[<ffffffff81071add>] ffffffff81071add
[1406104.152131] CPU: 0 PID: 1989 Comm: kworker/u8:2 Tainted: GF
     3.14.51-grsec-delta #1
[1406104.152229] Hardware name: Skyport Systems CD/S2600WTT, BIOS
SE5C610.86B.01.01.0009.C1.060120151350 06/01/2015
[1406104.152334] Workqueue: writeback ffffffff8121e4c0 (flush-252:1)
[1406104.152400] task: ffff880492bcc4a0 ti: ffff880492bccba8 task.ti:
ffff880492bccba8
[1406104.152494] RIP: e030:[<ffffffff8121b090>]  [<ffffffff8121b090>]
queue_io+0xd0/0x250
[1406104.152585] RSP: e02b:ffff8803c46cfbf0  EFLAGS: 00000202
[1406104.152635] RAX: ffff8803c46cfc88 RBX: ffff8804bd6691e0 RCX:
0000000000000000  // RBX == &wb
[1406104.152743] RDX: ffff8803c46cfc10 RSI: ffff8803c46cfc10 RDI:
0000000000000000
[1406104.152833] RBP: ffff8803c46cfc48 R08: 0000000000000001 R09:
0000000000000000
[1406104.152915] R10: 0000000000000001 R11: 0000000000000001 R12:
000000009ec8f79a  // R12 == moved
[1406104.152997] R13: ffff8803c46cfc10 R14: 0000000000000000 R15:
ffff8804a248d018
[1406104.153083] FS:  0000000000000000(0000) GS:ffff8804dc800000(0000)
knlGS:ffff8804dc800000
[1406104.153176] CS:  e033 DS: 0000 ES: 0000 CR0: 0000000080050033
[1406104.153227] CR2: 000000c208631000 CR3: 00000004c03fb000 CR4:
0000000000042660
[1406104.153311] Stack:
[1406104.153350]  ffff8804bd6692e8 00000000bd669320 ffff8804bd6692d8
ffff8803c46cfcf8
[1406104.153447]  ffff8804a248d018 ffff8804a248d018 ffff8803c46cfcf8
0000000114f270f4  // tmp.next, tmp.prev, _, do_sb_sort|_
[1406104.153543]  ffff8804bd6691e0 ffff8804bd6691f8 ffff8804bd6691f8
ffff8803c46cfcb8
[1406104.153640] Call Trace:
[1406104.153685]  [<ffffffff8121b434>] wb_writeback+0x224/0x2c0
[1406104.153748]  [<ffffffff8121e764>] bdi_writeback_workfn+0x2a4/0x430
[1406104.153815]  [<ffffffff8108e91b>] process_one_work+0x21b/0x530
[1406104.153877]  [<ffffffff8108e862>] ? process_one_work+0x162/0x530
[1406104.153938]  [<ffffffff8108ed5d>] worker_thread+0x12d/0x3e0
[1406104.153990]  [<ffffffff8108ec30>] ? process_one_work+0x530/0x530
[1406104.154047]  [<ffffffff81096357>] kthread+0xe7/0x100
[1406104.154106]  [<ffffffff81096270>] ? kthread_create_on_node+0x220/0x220
[1406104.154176]  [<ffffffff8168a589>] ret_from_fork+0x49/0x80
[1406104.154235]  [<ffffffff81096270>] ? kthread_create_on_node+0x220/0x220
[1406104.154312] Code: 1f 40 00 48 8b 45 b8 48 3b 83 f8 00 00 00 74 7b
48 8b 45 c0 4c 8b bb 00 01 00 00 48 8b 40 10 48 85 c0 74 09 49 8b 4f
e8 48 39 08 <78> 5e 4c 89 ff 41 83 c4 01 e8 e2 9b 17 00 48 8b 55 c8 4c
89 ee
[1406104.158778] BUG: soft lockup - CPU#3 stuck for 22s! [charlie:1364]
[1406104.166935] Modules linked in: xt_physdev(F) xen_netback(F)
xen_blkback(F) ip_gre(F) ip_tunnel(F) ixgbevf(F) dm_crypt(F)
algif_skcipher(F) af_alg(F) openvswitch(F) vxlan(F) gre(F)
libcrc32c(F) xen_pciback(F) nfsd(F) auth_rpcgss(F) nfs_acl(F) nfs(F)
lockd(F) fscache(F) sunrpc(F) xt_LOG(F) xt_limit(F) xt_tcpudp(F)
nf_conntrack_ipv4(F) nf_defrag_ipv4(F) xt_conntrack(F) nf_conntrack(F)
iptable_filter(F) ip_tables(F) x_tables(F) iTCO_wdt(F)
iTCO_vendor_support(F) gpio_ich(F) raid1(F) x86_pkg_temp_thermal(F)
intel_powerclamp(F) coretemp(F) crc32_pclmul(F) ghash_clmulni_intel(F)
aesni_intel(F) aes_x86_64(F) lrw(F) gf128mul(F) glue_helper(F)
ablk_helper(F) cryptd(F) sb_edac(F) me
i_me(F) mei(F) lpc_ich(F) mfd_core(F) i2c_i801(F) ioatdma(F)
squashfs(F) ixgbe(F) mdio(F) ptp(F) pps_core(F) dca(F) ahci(F)
libahci(F) tpm_tis(F)
[1406104.167696] irq event stamp: 449466544
[1406104.167741] hardirqs last  enabled at (449466543):
[<ffffffff81680dad>] ffffffff81680dad
[1406104.167830] hardirqs last disabled at (449466544):
[<ffffffff81681303>] ffffffff81681303
[1406104.167918] softirqs last  enabled at (449466542):
[<ffffffff810716d4>] ffffffff810716d4
[1406104.168007] softirqs last disabled at (449466529):
[<ffffffff81071add>] ffffffff81071add
[1406104.168097] CPU: 3 PID: 1364 Comm: charlie Tainted: GF
3.14.51-grsec-delta #1
[1406104.168189] Hardware name: Skyport Systems CD/S2600WTT, BIOS
SE5C610.86B.01.01.0009.C1.060120151350 06/01/2015
[1406104.168284] task: ffff8804ae1c44a0 ti: ffff8804ae1c4ba8 task.ti:
ffff8804ae1c4ba8
[1406104.168367] RIP: e030:[<ffffffff8101de86>]  [<ffffffff8101de86>]
native_read_tsc+0x6/0x20
[1406104.168466] RSP: e02b:ffff8804c5933d18  EFLAGS: 00000282
[1406104.168545] RAX: 0000000034c7545c RBX: 0000000000000001 RCX:
0000000000005358
[1406104.168635] RDX: 00000000000bf6ae RSI: 00000000000007e4 RDI:
0000000000000001
[1406104.168718] RBP: ffff8804c5933d18 R08: 0000000000000001 R09:
0000000000636000
[1406104.168801] R10: 00000000000001f9 R11: 0000000000000000 R12:
0000000034c7543e
[1406104.168884] R13: 0000000000000003 R14: ffffffff817e6f20 R15:
ffff8804c5933e70
[1406104.168980] FS:  0000694137550700(0000) GS:ffff8804dce00000(0000)
knlGS:0000000000000000
[1406104.169074] CS:  e033 DS: 0000 ES: 0000 CR0: 0000000080050033
[1406104.169125] CR2: 000000c208e8f000 CR3: 00000004c03fb000 CR4:
0000000000042660
[1406104.169208] Stack:
[1406104.169246]  ffff8804c5933d40 ffffffff81387098 ffff8804bd669308
000000008eb9b070
[1406104.169344]  0000000038dcbc69 ffff8804c5933d50 ffffffff81386f8f
ffff8804c5933d78
[1406104.169441]  ffffffff810ce29e ffff8804bd669320 ffff8804bd669308
ffff8804a1d50338
[1406104.169539] Call Trace:
[1406104.169584]  [<ffffffff81387098>] delay_tsc+0x48/0x90
[1406104.169642]  [<ffffffff81386f8f>] __delay+0xf/0x20
[1406104.169697]  [<ffffffff810ce29e>] do_raw_spin_lock+0xde/0x1d0
[1406104.169767]  [<ffffffff8167fb66>] _raw_spin_lock+0x56/0x70
[1406104.169829]  [<ffffffff8121e43f>] ? inode_wb_list_del+0x3f/0x80
[1406104.169890]  [<ffffffff8121e43f>] inode_wb_list_del+0x3f/0x80
[1406104.169953]  [<ffffffff8120ae4f>] evict+0x4f/0x1b0
[1406104.170009]  [<ffffffff8120b10a>] iput+0x10a/0x1c0
[1406104.170069]  [<ffffffff811fad2a>] do_unlinkat+0x1aa/0x3a0
[1406104.170134]  [<ffffffff81388522>] ? trace_hardirqs_on_thunk+0x41/0x46
[1406104.170198]  [<ffffffff811fee06>] SyS_unlink+0x16/0x20
[1406104.170256]  [<ffffffff8168a649>] system_call_fastpath+0x1a/0x1f
[1406104.170315] Code: 00 00 00 00 00 00 41 5c 41 5d 5d c3 66 0f 1f 84
00 00 00 00 00 0f 1f 44 00 00 55 48 89 e5 5d c3 0f 1f 44 00 00 55 48
89 e5 0f 31 <48> c1 e2 20 89 c0 48 09 d0 5d c3 0f 1f 44 00 00 66 2e 0f
1f 84
[1406104.693036] list_del corruption. prev->next should be
ffff8804a248d018, but was 0000000114f2a8bf
[1406104.693168] invalid opcode: 0000 [#1] SMP
[1406104.693240] Modules linked in: xt_physdev(F) xen_netback(F)
xen_blkback(F) ip_gre(F) ip_tunnel(F) ixgbevf(F) dm_crypt(F)
algif_skcipher(F) af_alg(F) openvswitch(F) vxlan(F) gre(F)
libcrc32c(F) xen_pciback(F) nfsd(F) auth_rpcgss(F) nfs_acl(F) nfs(F)
lockd(F) fscache(F) sunrpc(F) xt_LOG(F) xt_limit(F) xt_tcpudp(F)
nf_conntrack_ipv4(F) nf_defrag_ipv4(F) xt_conntrack(F) nf_conntrack(F)
iptable_filter(F) ip_tables(F) x_tables(F) iTCO_wdt(F)
iTCO_vendor_support(F) gpio_ich(F) raid1(F) x86_pkg_temp_thermal(F)
intel_powerclamp(F) coretemp(F) crc32_pclmul(F) ghash_clmulni_intel(F)
aesni_intel(F) aes_x86_64(F) lrw(F) gf128mul(F) glue_helper(F)
ablk_helper(F) cryptd(F) sb_edac(F) mei_me(F) mei(F) lpc_ich(F)
mfd_core(F) i2c_i801(F) ioatdma(F) squashfs(F) ixgbe(F) mdio(F) ptp(F)
pps_core(F) dca(F) ahci(F) libahci(F) tpm_tis(F)
[1406104.694176] CPU: 2 PID: 855 Comm: sierra Tainted: GF
3.14.51-grsec-delta #1
[1406104.694278] Hardware name: Skyport Systems CD/S2600WTT, BIOS
SE5C610.86B.01.01.0009.C1.060120151350 06/01/2015
[1406104.694391] task: ffff8804c824adc0 ti: ffff8804c824b4c8 task.ti:
ffff8804c824b4c8
[1406104.694498] RIP: e030:[<ffffffff81394c59>]  [<ffffffff81394c59>]
__list_del_entry_debug+0x89/0xb0
[1406104.694619] RSP: e02b:ffff8800109cfd70  EFLAGS: 00010282
[1406104.694683] RAX: 0000000000000054 RBX: ffff8804a248d018 RCX:
0000000000000006   // RBX == &inode->i_wb_list
[1406104.694782] RDX: 0000000000001fa0 RSI: ffff8804c824b760 RDI:
0000000000000200
[1406104.694880] RBP: ffff8800109cfd70 R08: 0000000000000006 R09:
0000000000000000
[1406104.694977] R10: 0000000000000001 R11: 0000000000000000 R12:
ffff8804a248ce70   // R12 == &inode
[1406104.695078] R13: ffff8804bd579308 R14: ffff8804bd578f00 R15:
ffff8804bd5792d8   // R14 == &bdi
[1406104.695183] FS:  0000634db5d82700(0000) GS:ffff8804dcc00000(0000)
knlGS:ffff8804dcc00000
[1406104.695285] CS:  e033 DS: 0000 ES: 0000 CR0: 0000000080050033
[1406104.695353] CR2: 000000c209075000 CR3: 00000004c2e01000 CR4:
0000000000042660
[1406104.695451] Stack:
[1406104.695501]  ffff8800109cfd88 ffffffff81394c8d ffff8804a248d018
ffff8800109cfdc0
[1406104.695622]  ffffffff8121ae26 ffff8804a248ce70 ffff8800109cfe88
0000000000000000
[1406104.695753]  0000000000001846 0000000000000000 ffff8800109cfe20
ffffffff81283958
[1406104.695872] Call Trace:
[1406104.695935]  [<ffffffff81394c8d>] __list_del_entry+0xd/0x30
[1406104.696000]  [<ffffffff8121ae26>] __mark_inode_dirty+0x156/0x2f0
[1406104.696076]  [<ffffffff81283958>] ext4_setattr+0x2a8/0x720
[1406104.696152]  [<ffffffff81314da2>] ? selinux_inode_setattr+0xc2/0xd0
[1406104.696230]  [<ffffffff8132dca1>] ? evm_inode_setattr+0x21/0x80
[1406104.696302]  [<ffffffff8120c779>] notify_change+0x249/0x360
[1406104.696380]  [<ffffffff811e5551>] chown_common+0x181/0x1d0
[1406104.696452]  [<ffffffff811e73d1>] SyS_fchown+0xa1/0xb0
[1406104.696518]  [<ffffffff8168a649>] system_call_fastpath+0x1a/0x1f
[1406104.696584] Code: 00 00 48 89 fe 31 c0 48 c7 c7 18 42 9f 81 e8 17
08 2e 00 0f 0b 0f 1f 44 00 00 48 89 fe 31 c0 48 c7 c7 50 42 9f 81 e8
ff 07 2e 00 <0f> 0b 0f 1f 44 00 00 48 89 fe 31 c0 48 c7 c7 90 42 9f 81
e8 e7
[1406104.697209] RIP  [<ffffffff81394c59>] __list_del_entry_debug+0x89/0xb0
[1406104.697285]  RSP <ffff8800109cfd70>
[1406104.697798] ---[ end trace fdfa4fe160b516af ]---

The remaining traces from this log
(https://gist.github.com/eswierk/716e73c30f55c7962584c73ad7f4f82c)
show more of the same: either someone spinning trying to grab a wb
list_lock, or the same writeback worker (pid 1989) looping in
move_expired_inodes.

The worker never gets beyond the first loop, trying to move entries
from delaying_queue to the tmp list. The moved count keeps increasing,
while the tmp.next and tmp.prev pointers never change. This suggests
that the i_wb_list pointers are messed up in such a way that the
list_del and list_add steps of the list_move add up to a no-op, and
the delaying_queue never empties.

Assuming the wb list_lock prevents one inode's i_wb_list from being
manipulated concurrently from two processes, the above situation seems
impossible: if pid 1989 grabs the lock first, and dies holding the
lock, pid 855 can't grab it; if pid 855 grabs the lock first, and
loops forever holding the lock, pid 1989 can't grab it.

One explanation is that the two processes are not holding the same
lock. The first trace shows in pid 1989, &wb (in rbx) is
0xffff8804bd6691e0. The third trace shows in pid 855, &bdi (in r14) is
0xffff8804bd578f00. But the relationships among inode, bdi, and wb are
unclear to me, and I don't know whether some invariant is being
violated or whether this is the root cause or a symptom.

Any clues about what could have caused the original i_wb_list corruption?

I can also provide logs from other instances of this issue, as well as
kernel config, more details on our patches, and anything else that
would help.

--Ed
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html



[Index of Archives]     [Linux Ext4 Filesystem]     [Union Filesystem]     [Filesystem Testing]     [Ceph Users]     [Ecryptfs]     [AutoFS]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux Cachefs]     [Reiser Filesystem]     [Linux RAID]     [Samba]     [Device Mapper]     [CEPH Development]
  Powered by Linux