PROBLEM: double fault in md_end_io

Paweł Wiejacha <pawel.wiejacha@xxxxxxxxxxxx> · Fri, 9 Apr 2021 23:40:52 +0200

Hello,

Two of my machines constantly crash with a double fault like this:

1146  <0>[33685.629591] traps: PANIC: double fault, error_code: 0x0
1147  <4>[33685.629593] double fault: 0000 [#1] SMP NOPTI
1148  <4>[33685.629594] CPU: 10 PID: 2118287 Comm: kworker/10:0
Tainted: P           OE     5.11.8-051108-generic #202103200636
1149  <4>[33685.629595] Hardware name: ASUSTeK COMPUTER INC. KRPG-U8
Series/KRPG-U8 Series, BIOS 4201 09/25/2020
1150  <4>[33685.629595] Workqueue: xfs-conv/md12 xfs_end_io [xfs]
1151  <4>[33685.629596] RIP: 0010:__slab_free+0x23/0x340
1152  <4>[33685.629597] Code: 4c fe ff ff 0f 1f 00 0f 1f 44 00 00 55
48 89 e5 41 57 49 89 cf 41 56 49 89 fe 41 55 41 54 49 89 f4 53 48 83
e4 f0 48 83 ec 70 <48> 89 54 24 28 0f 1f 44 00 00 41 8b 46 28 4d 8b 6c
24 20 49 8b 5c
1153  <4>[33685.629598] RSP: 0018:ffffa9bc00848fa0 EFLAGS: 00010086
1154  <4>[33685.629599] RAX: ffff94c04d8b10a0 RBX: ffff94437a34a880
RCX: ffff94437a34a880
1155  <4>[33685.629599] RDX: ffff94437a34a880 RSI: ffffcec745e8d280
RDI: ffff944300043b00
1156  <4>[33685.629599] RBP: ffffa9bc00849040 R08: 0000000000000001
R09: ffffffff82a5d6de
1157  <4>[33685.629600] R10: 0000000000000001 R11: 000000009c109000
R12: ffffcec745e8d280
1158  <4>[33685.629600] R13: ffff944300043b00 R14: ffff944300043b00
R15: ffff94437a34a880
1159  <4>[33685.629601] FS:  0000000000000000(0000)
GS:ffff94c04d880000(0000) knlGS:0000000000000000
1160  <4>[33685.629601] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
1161  <4>[33685.629602] CR2: ffffa9bc00848f98 CR3: 000000014d04e000
CR4: 0000000000350ee0
1162  <4>[33685.629602] Call Trace:
1163  <4>[33685.629603]  <IRQ>
1164  <4>[33685.629603]  ? kfree+0x3bc/0x3e0
1165  <4>[33685.629603]  ? mempool_kfree+0xe/0x10
1166  <4>[33685.629603]  ? mempool_kfree+0xe/0x10
1167  <4>[33685.629604]  ? mempool_free+0x2f/0x80
1168  <4>[33685.629604]  ? md_end_io+0x4a/0x70
1169  <4>[33685.629604]  ? bio_endio+0xdc/0x130
1170  <4>[33685.629605]  ? bio_chain_endio+0x2d/0x40
1171  <4>[33685.629605]  ? md_end_io+0x5c/0x70
1172  <4>[33685.629605]  ? bio_endio+0xdc/0x130
1173  <4>[33685.629605]  ? bio_chain_endio+0x2d/0x40
1174  <4>[33685.629606]  ? md_end_io+0x5c/0x70
1175  <4>[33685.629606]  ? bio_endio+0xdc/0x130
1176  <4>[33685.629606]  ? bio_chain_endio+0x2d/0x40
1177  <4>[33685.629607]  ? md_end_io+0x5c/0x70
... repeated ...
1436  <4>[33685.629677]  ? bio_endio+0xdc/0x130
1437  <4>[33685.629677]  ? bio_chain_endio+0x2d/0x40
1438  <4>[33685.629677]  ? md_end_io+0x5c/0x70
1439  <4>[33685.629677]  ? bio_endio+0xdc/0x130
1440  <4>[33685.629678]  ? bio_chain_endio+0x2d/0x40
1441  <4>[33685.629678]  ? md_
1442  <4>[33685.629679] Lost 357 message(s)!

This happens on:
5.11.8-051108-generic #202103200636 SMP Sat Mar 20 11:17:32 UTC 2021
and on 5.8.0-44-generic #50~20.04.1-Ubuntu
(https://changelogs.ubuntu.com/changelogs/pool/main/l/linux/linux_5.8.0-44.50/changelog)
which contains backported
https://github.com/torvalds/linux/commit/41d2d848e5c09209bdb57ff9c0ca34075e22783d
("md: improve io stats accounting").
The 5.8.18-050818-generic #202011011237 SMP Sun Nov 1 12:40:15 UTC
2020 which does not contain above suspected change does not crash.

If there's a better way/place to report this bug just let me know. If
not, here are steps to reproduce:

1. Create a RAID 0 device using three Micron_9300_MTFDHAL7T6TDP disks.
mdadm --create --verbose /dev/md12 --level=stripe --raid-devices=3
/dev/nvme0n1p1 /dev/nvme1n1p1 /dev/nvme2n1p1

2. Setup xfs on it:
mkfs.xfs /dev/md12 and mount it

3. Write to a file on this filesystem:
while true; do rm -rf /mnt/md12/crash* ; for i in `seq 8`; do dd
if=/dev/zero of=/mnt/md12/crash$i bs=32K count=50000000 & done; wait;
done
Wait for a crash (usually less than 20 min).

I couldn't reproduce it with a single dd process (maybe I have to wait
a little longer), but a single cat
/very/large/file/on/cephfs/over100GbE > /mnt/md12/crash is enough for
this double fault to occur.

More info:
This long mempool_kfree - md_end_io - *  -md_end_io stack trace looks
always the same, but the panic occurs in different places:

pstore/6948115143318/dmesg.txt-<4>[545649.087998] CPU: 88 PID: 0 Comm:
swapper/88 Tainted: P           OE     5.11.8-051108-generic
#202103200636
pstore/6948377398316/dmesg.txt-<4>[11275.914909] CPU: 14 PID: 0 Comm:
swapper/14 Tainted: P           OE     5.11.8-051108-generic
#202103200636
pstore/6948532816002/dmesg.txt-<4>[33685.629594] CPU: 10 PID: 2118287
Comm: kworker/10:0 Tainted: P           OE     5.11.8-051108-generic
#202103200636
pstore/6948532816002/dmesg.txt-<4>[33685.629595] Workqueue:
xfs-conv/md12 xfs_end_io [xfs]
pstore/6948855849083/dmesg.txt-<4>[42934.321129] CPU: 85 PID: 0 Comm:
swapper/85 Tainted: P           OE     5.11.8-051108-generic
#202103200636
pstore/6948876331782/dmesg.txt-<4>[ 3475.020672] CPU: 86 PID: 0 Comm:
swapper/86 Tainted: P           OE     5.11.8-051108-generic
#202103200636
pstore/6949083860307/dmesg.txt-<4>[43048.254375] CPU: 45 PID: 0 Comm:
swapper/45 Tainted: P           OE     5.11.8-051108-generic
#202103200636
pstore/6949091775931/dmesg.txt-<4>[ 1150.790240] CPU: 64 PID: 0 Comm:
swapper/64 Tainted: P           OE     5.11.8-051108-generic
#202103200636
pstore/6949123356826/dmesg.txt-<4>[ 6963.858253] CPU: 6 PID: 51 Comm:
kworker/6:0 Tainted: P           OE     5.11.8-051108-generic
#202103200636
pstore/6949123356826/dmesg.txt-<4>[ 6963.858255] Workqueue: ceph-msgr
ceph_con_workfn [libceph]
pstore/6949123356826/dmesg.txt-<4>[ 6963.858253] CPU: 6 PID: 51 Comm:
kworker/6:0 Tainted: P           OE     5.11.8-051108-generic
#202103200636
pstore/6949123356826/dmesg.txt-<4>[ 6963.858255] Workqueue: ceph-msgr
ceph_con_workfn [libceph]
pstore/6949152322085/dmesg.txt-<4>[ 6437.077874] CPU: 59 PID: 0 Comm:
swapper/59 Tainted: P           OE     5.11.8-051108-generic
#202103200636

cat /proc/cmdline
BOOT_IMAGE=/vmlinuz-5.11.8-051108-generic
root=/dev/mapper/ubuntu--vg-ubuntu--lv ro net.ifnames=0 biosdevname=0
strict-devmem=0 mitigations=off iommu=pt

cat /proc/cpuinfo
model name      : AMD EPYC 7552 48-Core Processor

cat /proc/mounts
/dev/md12 /mnt/ssd1 xfs
rw,noatime,attr2,inode64,logbufs=8,logbsize=32k,sunit=1024,swidth=3072,prjquota
0 0

Let me know if you need more information.

Best regards,
Paweł Wiejacha