[BUG] mlx5_core general protection fault in mlx5_cmd_comp_handler

Jinpu Wang <jinpu.wang@xxxxxxxxx> · Wed, 12 Oct 2022 13:55:55 +0200

Hi Leon, hi Saeed,

We have seen crashes during server shutdown on both kernel 5.10 and
kernel 5.15 with GPF in mlx5 mlx5_cmd_comp_handler function.

All of the crashes point to

1606                         memcpy(ent->out->first.data,
ent->lay->out, sizeof(ent->lay->out));

I guess, it's kind of use after free for ent buffer. I tried to reprod
by repeatedly reboot the testing servers, but no success  so far.

A sample output from kernel 5.15.32

<30>[ 1246.308327] systemd-shutdown[1]: Rebooting.

<6>[ 1246.308429] kvm: exiting hardware virtualization

<6>[ 1246.602813] megaraid_sas 0000:65:00.0:
megasas_disable_intr_fusion is called outbound_intr_mask:0x40000009

<6>[ 1246.605901] mlx5_core 0000:4b:00.1: Shutdown was called

<6>[ 1246.608371] mlx5_core 0000:4b:00.0: Shutdown was called

<4>[ 1246.608811] general protection fault, probably for non-canonical
address 0xb028ffa964bb3e4b: 0000 [#1] SMP

<4>[ 1246.615211] CPU: 95 PID: 5670 Comm: kworker/u256:6 Tainted: G
       O      5.15.32-pserver
#5.15.32-1+feature+linux+5.15.y+20220405.0441+03895bda~deb11

<4>[ 1246.628483] Hardware name: Dell Inc. PowerEdge R650/0PYXKY, BIOS
1.5.5 02/10/2022

<4>[ 1246.635401] Workqueue: mlx5_cmd_0000:4b:00.0 cmd_work_handler [mlx5_core]

<4>[ 1246.642459] RIP: 0010:mlx5_cmd_comp_handler+0xda/0x490 [mlx5_core]

<4>[ 1246.649707] Code: b0 00 00 00 01 0f 84 9a 02 00 00 4c 89 ff e8
9d e5 ff ff e8 28 86 34 db 49 8b 97 00 01 00 00 49 89 87 20 01 00 00
49 8b 47 10 <48> 8b 72 20 48 8b 7a 28 31 d2 48 89 70 1c 4c 89 fe 48 89
78 24 48

<4>[ 1246.664596] RSP: 0018:ff59e28ca3103db0 EFLAGS: 00010202

<4>[ 1246.672167] RAX: ff25be460afc4580 RBX: ff25be460d196180 RCX:
0000000000000017

<4>[ 1246.679804] RDX: b028ffa964bb3e2b RSI: 000000000003a550 RDI:
000ce7c15700ae52

<4>[ 1246.687528] RBP: ff59e28ca3103e28 R08: 0000000000000001 R09:
ffffffffc0dc5500

<4>[ 1246.695331] R10: ff25be4607793000 R11: ff25be4607793000 R12:
0000000000000000

<4>[ 1246.703167] R13: ff25be460d196180 R14: ff25be460d1962a8 R15:
ff25be4607793000

<4>[ 1246.711051] FS:  0000000000000000(0000)
GS:ff25bec4019c0000(0000) knlGS:0000000000000000

<4>[ 1246.719000] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033

<4>[ 1246.726844] CR2: 00007f82d8850006 CR3: 000000695760a004 CR4:
0000000000771ee0

<4>[ 1246.734856] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000

<4>[ 1246.742757] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
0000000000000400

<4>[ 1246.750573] PKRU: 55555554

<4>[ 1246.758289] Call Trace:

<4>[ 1246.765755]  <TASK>

<4>[ 1246.772997]  ? dump_command+0x159/0x3d0 [mlx5_core]

<4>[ 1246.780267]  cmd_work_handler+0x270/0x5d0 [mlx5_core]

<4>[ 1246.787576]  process_one_work+0x1d6/0x370

<4>[ 1246.794669]  worker_thread+0x4d/0x3d0

<4>[ 1246.801796]  ? rescuer_thread+0x390/0x390

<4>[ 1246.808895]  kthread+0x124/0x150

<4>[ 1246.815957]  ? set_kthread_struct+0x40/0x40

<4>[ 1246.823051]  ret_from_fork+0x1f/0x30

<4>[ 1246.830076]  </TASK>

Is this problem known, maybe already fixed?
I briefly checked the git, don't see anything, could you give me some hint?

Thanks!
Jinpu Wang @ IONOS