About raid5 lock up

Guoqing Jiang <guoqing.jiang@xxxxxxxxxxxxxxx> · Thu, 21 Nov 2019 15:14:06 +0100

Hi All,

Recently, we got a report about raid5 lockup with 4.4.62 kernel as 
follows, but I failed to
reproduce it locally.

 21:40:57 >>> [4783749.306796] NMI watchdog: Watchdog detected hard 
LOCKUP on cpu 22
 21:40:58 >>> [4783749.743367] NMI watchdog: Watchdog detected hard 
LOCKUP on cpu 39
 21:40:59 >>> [4783750.324941] NMI watchdog: Watchdog detected hard 
LOCKUP on cpu 5
 21:41:00 >>> [4783751.739422] NMI watchdog: Watchdog detected hard 
LOCKUP on cpu 16
 21:41:00 >>> [4783752.115471] NMI watchdog: Watchdog detected hard 
LOCKUP on cpu 21
 21:41:01 >>> [4783752.355232] NMI watchdog: Watchdog detected hard 
LOCKUP on cpu 17
 21:41:01 >>> [4783752.627075] NMI watchdog: Watchdog detected hard 
LOCKUP on cpu 4
 21:41:01 >>> [4783752.775597] NMI watchdog: Watchdog detected hard 
LOCKUP on cpu 0
 21:41:01 >>> [4783752.941282] NMI watchdog: Watchdog detected hard 
LOCKUP on cpu 11
 21:41:04 >>> [4783755.955468] NMI watchdog: Watchdog detected hard 
LOCKUP on cpu 9
 21:41:04 >>> [4783756.142500] NMI watchdog: Watchdog detected hard 
LOCKUP on cpu 23
 21:41:04 >>> [4783756.211894] NMI watchdog: Watchdog detected hard 
LOCKUP on cpu 6
 21:41:04 >>> [4783756.264305] NMI watchdog: Watchdog detected hard 
LOCKUP on cpu 8
 21:41:07 >>> [4783759.284782] NMI watchdog: Watchdog detected hard 
LOCKUP on cpu 41
 21:42:17 >>> [4783828.834689] NMI watchdog: BUG: soft lockup - CPU#2 
stuck for 22s! [ppd:3164]
 21:42:45 >>> [4783856.835762] NMI watchdog: BUG: soft lockup - CPU#2 
stuck for 22s! [ppd:3164]
 21:43:13 >>> [4783884.837093] NMI watchdog: BUG: soft lockup - CPU#2 
stuck for 22s! [ppd:3164]
 21:43:41 >>> [4783912.838201] NMI watchdog: BUG: soft lockup - CPU#2 
stuck for 23s! [ppd:3164]
 21:44:02 >>> [4783933.941087] NMI watchdog: Watchdog detected hard 
LOCKUP on cpu 2
 22:04:02 >>> [4785134.037889] NMI watchdog: Watchdog detected hard 
LOCKUP on cpu 3
 22:05:37 >>> [4785228.764631] NMI watchdog: Watchdog detected hard 
LOCKUP on cpu 10
 22:07:03 >>> [4785315.326793] NMI watchdog: Watchdog detected hard 
LOCKUP on cpu 27
 22:08:40 >>> [4785411.462601] NMI watchdog: Watchdog detected hard 
LOCKUP on cpu 35

And most hard lockups have similar calltrace like this.

21:42:17 kernel: [4783749.306796] NMI watchdog: Watchdog detected hard 
LOCKUP on cpu 22
21:42:17 kernel: [4783749.306911] Modules linked in: dm_snapshot 
dm_bufio cpufreq_ondemand cpufreq_powersave cpufreq_stats 
cpufreq_userspace cpufreq_conservative ipmi_devintf rdma_ucm ib_ipoib 
ib_uverbs ib_umad x86_pkg_te
mp_thermal crct10dif_pclmul crc32_pclmul ghash_clmulni_intel 
sha256_ssse3 sha256_generic hmac drbg ansi_cprng aesni_intel aes_x86_64 
glue_helper ablk_helper cryptd efi_pstore efivars ipmi_si 
ipmi_msghandler acpi_power_meter hwmon acpi_cpu
freq acpi_pad button scst_vdisk yars(O) ib_srpt scst rdma_cm iw_cm ib_cm 
efivarfs raid456 libcrc32c async_raid6_recov async_memcpy async_pq 
async_xor xor async_tx raid6_pq mlx4_ib ib_sa ib_mad ib_core ib_addr 
ipv6 ib_netlink hid_generic u
sbhid sg crc32c_intel mlx4_core mlx_compat mpt3sas xhci_pci i2c_i801 
ahci i2c_core xhci_hcd libahci
 21:42:17 kernel: [4783749.310231] CPU: 22 PID: 33558 Comm: 
649175e1dc3_0 Tainted: G        W  O    4.4.62-1-storage #4.4.62-1.3
 21:42:17 kernel: [4783749.310233] Hardware name: Supermicro 
SSG-2029P-ACR24L/X11DPH-T, BIOS 3.1 05/22/2019
 21:42:17 kernel: [4783749.310235] task: ffff880bdab08000 ti: 
ffff880fba8e8000 task.ti: ffff880fba8e8000
 21:42:17 kernel: [4783749.310237] RIP: 0010:[<ffffffff81099fb1>]  
[<ffffffff81099fb1>] queued_spin_lock_slowpath+0xf1/0x160
 21:42:17 kernel: [4783749.310245] RSP: 0018:ffff880fba8eb738 EFLAGS: 
00000046
 21:42:17 kernel: [4783749.310246] RAX: 0000000000000000 RBX: 
ffff880c6d7b8800 RCX: ffff88086fc35d80
 21:42:17 kernel: [4783749.310247] RDX: ffff88107fc95d80 RSI: 
00000000005c0000 RDI: ffff880c6d7b8ad4
 21:42:17 kernel: [4783749.310248] RBP: ffff880fba8eb738 R08: 
0000000000000001 R09: 00000000633946d0
 21:42:17 kernel: [4783749.310249] R10: 0000000000000000 R11: 
0000000000000002 R12: ffff880c6d7b8810
 21:42:17 kernel: [4783749.310250] R13: 0000000000000000 R14: 
ffff880bf7fe8000 R15: ffff880bf7fe8000
 21:42:17 kernel: [4783749.310251] FS:  0000000000000000(0000) 
GS:ffff88107fc80000(0000) knlGS:0000000000000000
 21:42:17 kernel: [4783749.310253] CS:  0010 DS: 0000 ES: 0000 CR0: 
0000000080050033
 21:42:17 kernel: [4783749.310254] CR2: 0000558286fd0a08 CR3: 
0000000001a09000 CR4: 00000000003406e0
 21:42:17 kernel: [4783749.310255] DR0: 0000000000000000 DR1: 
0000000000000000 DR2: 0000000000000000
 21:42:17 kernel: [4783749.310256] DR3: 0000000000000000 DR6: 
00000000fffe0ff0 DR7: 0000000000000400
 21:42:17 kernel: [4783749.310257] Stack:
 21:42:17 kernel: [4783749.310258]  ffff880fba8eb748 ffffffff81598060 
ffff880fba8eb800 ffffffffa0230b0a
 21:42:17 kernel: [4783749.310261]  ffff880fb74c1c00 0000000002011200 
ffff88046fc03700 ffff880fba8eb7b0
 21:42:17 kernel: [4783749.310263]  ffff880c6d7b8808 0000000000000002 
ffff880c6d7b89d0 0000000001265000
 21:42:17 kernel: [4783749.310265] Call Trace:
 21:42:17 kernel: [4783749.310272]  [<ffffffff81598060>] 
_raw_spin_lock+0x20/0x30
 21:42:17 kernel: [4783749.310275]  [<ffffffffa0230b0a>] 
raid5_get_active_stripe+0x1da/0x5250 [raid456]
 21:42:17 kernel: [4783749.310281]  [<ffffffff8112d165>] ? 
mempool_alloc_slab+0x15/0x20
 21:42:17 kernel: [4783749.310283]  [<ffffffffa0231174>] 
raid5_get_active_stripe+0x844/0x5250 [raid456]
 21:42:17 kernel: [4783749.310289]  [<ffffffff812d5574>] ? 
generic_make_request+0x24/0x2b0
 21:42:17 kernel: [4783749.310293]  [<ffffffff810938b0>] ? 
wait_woken+0x90/0x90
 21:42:17 kernel: [4783749.310298]  [<ffffffff814a2adc>] 
md_make_request+0xfc/0x250
 21:42:17 kernel: [4783749.310309]  [<ffffffff812d5867>] 
submit_bio+0x67/0x150

My understanding is that there could be two possible reasons for lockup:

1. There is deadlock inside raid5 code somewhere which should be fixed.
2. Since spin_lock/unlock_irq are called in raid5_get_active_stripe, 
then if the function need to
handle massive IOs, could it possible hard lockup was triggered due to 
IRQs are disabled more
than 10s? If so, maybe we need to touch nmi watchdog before disable irq.

Coly and Dennis had reported raid5 lock up issues with different kernel 
versions (4.2.8, 4.7.0-rc7
and 4.8-rc5), not sure it they are related or not, but I guess it is not 
fixed in latest code.

Any thoughts?

[1]. https://marc.info/?l=linux-raid&m=150348807422853&w=2
[2]. https://marc.info/?l=linux-raid&m=146883211430999&w=2

Thanks.
Guoqing