Hi All,
Recently, we got a report about raid5 lockup with 4.4.62 kernel as
follows, but I failed to
reproduce it locally.
21:40:57 >>> [4783749.306796] NMI watchdog: Watchdog detected hard
LOCKUP on cpu 22
21:40:58 >>> [4783749.743367] NMI watchdog: Watchdog detected hard
LOCKUP on cpu 39
21:40:59 >>> [4783750.324941] NMI watchdog: Watchdog detected hard
LOCKUP on cpu 5
21:41:00 >>> [4783751.739422] NMI watchdog: Watchdog detected hard
LOCKUP on cpu 16
21:41:00 >>> [4783752.115471] NMI watchdog: Watchdog detected hard
LOCKUP on cpu 21
21:41:01 >>> [4783752.355232] NMI watchdog: Watchdog detected hard
LOCKUP on cpu 17
21:41:01 >>> [4783752.627075] NMI watchdog: Watchdog detected hard
LOCKUP on cpu 4
21:41:01 >>> [4783752.775597] NMI watchdog: Watchdog detected hard
LOCKUP on cpu 0
21:41:01 >>> [4783752.941282] NMI watchdog: Watchdog detected hard
LOCKUP on cpu 11
21:41:04 >>> [4783755.955468] NMI watchdog: Watchdog detected hard
LOCKUP on cpu 9
21:41:04 >>> [4783756.142500] NMI watchdog: Watchdog detected hard
LOCKUP on cpu 23
21:41:04 >>> [4783756.211894] NMI watchdog: Watchdog detected hard
LOCKUP on cpu 6
21:41:04 >>> [4783756.264305] NMI watchdog: Watchdog detected hard
LOCKUP on cpu 8
21:41:07 >>> [4783759.284782] NMI watchdog: Watchdog detected hard
LOCKUP on cpu 41
21:42:17 >>> [4783828.834689] NMI watchdog: BUG: soft lockup - CPU#2
stuck for 22s! [ppd:3164]
21:42:45 >>> [4783856.835762] NMI watchdog: BUG: soft lockup - CPU#2
stuck for 22s! [ppd:3164]
21:43:13 >>> [4783884.837093] NMI watchdog: BUG: soft lockup - CPU#2
stuck for 22s! [ppd:3164]
21:43:41 >>> [4783912.838201] NMI watchdog: BUG: soft lockup - CPU#2
stuck for 23s! [ppd:3164]
21:44:02 >>> [4783933.941087] NMI watchdog: Watchdog detected hard
LOCKUP on cpu 2
22:04:02 >>> [4785134.037889] NMI watchdog: Watchdog detected hard
LOCKUP on cpu 3
22:05:37 >>> [4785228.764631] NMI watchdog: Watchdog detected hard
LOCKUP on cpu 10
22:07:03 >>> [4785315.326793] NMI watchdog: Watchdog detected hard
LOCKUP on cpu 27
22:08:40 >>> [4785411.462601] NMI watchdog: Watchdog detected hard
LOCKUP on cpu 35
And most hard lockups have similar calltrace like this.
21:42:17 kernel: [4783749.306796] NMI watchdog: Watchdog detected hard
LOCKUP on cpu 22
21:42:17 kernel: [4783749.306911] Modules linked in: dm_snapshot
dm_bufio cpufreq_ondemand cpufreq_powersave cpufreq_stats
cpufreq_userspace cpufreq_conservative ipmi_devintf rdma_ucm ib_ipoib
ib_uverbs ib_umad x86_pkg_te
mp_thermal crct10dif_pclmul crc32_pclmul ghash_clmulni_intel
sha256_ssse3 sha256_generic hmac drbg ansi_cprng aesni_intel aes_x86_64
glue_helper ablk_helper cryptd efi_pstore efivars ipmi_si
ipmi_msghandler acpi_power_meter hwmon acpi_cpu
freq acpi_pad button scst_vdisk yars(O) ib_srpt scst rdma_cm iw_cm ib_cm
efivarfs raid456 libcrc32c async_raid6_recov async_memcpy async_pq
async_xor xor async_tx raid6_pq mlx4_ib ib_sa ib_mad ib_core ib_addr
ipv6 ib_netlink hid_generic u
sbhid sg crc32c_intel mlx4_core mlx_compat mpt3sas xhci_pci i2c_i801
ahci i2c_core xhci_hcd libahci
21:42:17 kernel: [4783749.310231] CPU: 22 PID: 33558 Comm:
649175e1dc3_0 Tainted: G W O 4.4.62-1-storage #4.4.62-1.3
21:42:17 kernel: [4783749.310233] Hardware name: Supermicro
SSG-2029P-ACR24L/X11DPH-T, BIOS 3.1 05/22/2019
21:42:17 kernel: [4783749.310235] task: ffff880bdab08000 ti:
ffff880fba8e8000 task.ti: ffff880fba8e8000
21:42:17 kernel: [4783749.310237] RIP: 0010:[<ffffffff81099fb1>]
[<ffffffff81099fb1>] queued_spin_lock_slowpath+0xf1/0x160
21:42:17 kernel: [4783749.310245] RSP: 0018:ffff880fba8eb738 EFLAGS:
00000046
21:42:17 kernel: [4783749.310246] RAX: 0000000000000000 RBX:
ffff880c6d7b8800 RCX: ffff88086fc35d80
21:42:17 kernel: [4783749.310247] RDX: ffff88107fc95d80 RSI:
00000000005c0000 RDI: ffff880c6d7b8ad4
21:42:17 kernel: [4783749.310248] RBP: ffff880fba8eb738 R08:
0000000000000001 R09: 00000000633946d0
21:42:17 kernel: [4783749.310249] R10: 0000000000000000 R11:
0000000000000002 R12: ffff880c6d7b8810
21:42:17 kernel: [4783749.310250] R13: 0000000000000000 R14:
ffff880bf7fe8000 R15: ffff880bf7fe8000
21:42:17 kernel: [4783749.310251] FS: 0000000000000000(0000)
GS:ffff88107fc80000(0000) knlGS:0000000000000000
21:42:17 kernel: [4783749.310253] CS: 0010 DS: 0000 ES: 0000 CR0:
0000000080050033
21:42:17 kernel: [4783749.310254] CR2: 0000558286fd0a08 CR3:
0000000001a09000 CR4: 00000000003406e0
21:42:17 kernel: [4783749.310255] DR0: 0000000000000000 DR1:
0000000000000000 DR2: 0000000000000000
21:42:17 kernel: [4783749.310256] DR3: 0000000000000000 DR6:
00000000fffe0ff0 DR7: 0000000000000400
21:42:17 kernel: [4783749.310257] Stack:
21:42:17 kernel: [4783749.310258] ffff880fba8eb748 ffffffff81598060
ffff880fba8eb800 ffffffffa0230b0a
21:42:17 kernel: [4783749.310261] ffff880fb74c1c00 0000000002011200
ffff88046fc03700 ffff880fba8eb7b0
21:42:17 kernel: [4783749.310263] ffff880c6d7b8808 0000000000000002
ffff880c6d7b89d0 0000000001265000
21:42:17 kernel: [4783749.310265] Call Trace:
21:42:17 kernel: [4783749.310272] [<ffffffff81598060>]
_raw_spin_lock+0x20/0x30
21:42:17 kernel: [4783749.310275] [<ffffffffa0230b0a>]
raid5_get_active_stripe+0x1da/0x5250 [raid456]
21:42:17 kernel: [4783749.310281] [<ffffffff8112d165>] ?
mempool_alloc_slab+0x15/0x20
21:42:17 kernel: [4783749.310283] [<ffffffffa0231174>]
raid5_get_active_stripe+0x844/0x5250 [raid456]
21:42:17 kernel: [4783749.310289] [<ffffffff812d5574>] ?
generic_make_request+0x24/0x2b0
21:42:17 kernel: [4783749.310293] [<ffffffff810938b0>] ?
wait_woken+0x90/0x90
21:42:17 kernel: [4783749.310298] [<ffffffff814a2adc>]
md_make_request+0xfc/0x250
21:42:17 kernel: [4783749.310309] [<ffffffff812d5867>]
submit_bio+0x67/0x150
My understanding is that there could be two possible reasons for lockup:
1. There is deadlock inside raid5 code somewhere which should be fixed.
2. Since spin_lock/unlock_irq are called in raid5_get_active_stripe,
then if the function need to
handle massive IOs, could it possible hard lockup was triggered due to
IRQs are disabled more
than 10s? If so, maybe we need to touch nmi watchdog before disable irq.
Coly and Dennis had reported raid5 lock up issues with different kernel
versions (4.2.8, 4.7.0-rc7
and 4.8-rc5), not sure it they are related or not, but I guess it is not
fixed in latest code.
Any thoughts?
[1]. https://marc.info/?l=linux-raid&m=150348807422853&w=2
[2]. https://marc.info/?l=linux-raid&m=146883211430999&w=2
Thanks.
Guoqing