Re: [BUG] mlx5_core general protection fault in mlx5_cmd_comp_handler

Moshe Shemesh <moshe@xxxxxxxxxx> · Tue, 22 Nov 2022 06:31:06 +0200

On 11/21/2022 11:11 AM, Jinpu Wang wrote:
External email: Use caution opening links or attachments


On Tue, Nov 15, 2022 at 5:41 PM Moshe Shemesh <moshe@xxxxxxxxxx> wrote:

On 11/15/2022 5:08 PM, Jinpu Wang wrote:
On Tue, Nov 15, 2022 at 6:46 AM Jinpu Wang <jinpu.wang@xxxxxxxxx> wrote:
On Tue, Nov 15, 2022 at 6:15 AM Moshe Shemesh <moshe@xxxxxxxxxx> wrote:
On 11/9/2022 11:51 AM, Jinpu Wang wrote:
On Mon, Oct 17, 2022 at 7:54 AM Jinpu Wang <jinpu.wang@xxxxxxxxx> wrote:
On Thu, Oct 13, 2022 at 12:27 PM Leon Romanovsky <leon@xxxxxxxxxx> wrote:
On Thu, Oct 13, 2022 at 10:32:55AM +0200, Jinpu Wang wrote:
On Thu, Oct 13, 2022 at 10:18 AM Leon Romanovsky <leon@xxxxxxxxxx> wrote:
On Wed, Oct 12, 2022 at 01:55:55PM +0200, Jinpu Wang wrote:
Hi Leon, hi Saeed,

We have seen crashes during server shutdown on both kernel 5.10 and
kernel 5.15 with GPF in mlx5 mlx5_cmd_comp_handler function.

All of the crashes point to

1606                         memcpy(ent->out->first.data,
ent->lay->out, sizeof(ent->lay->out));

I guess, it's kind of use after free for ent buffer. I tried to reprod
by repeatedly reboot the testing servers, but no success  so far.
My guess is that command interface is not flushed, but Moshe and me
didn't see how it can happen.

     1206         INIT_DELAYED_WORK(&ent->cb_timeout_work, cb_timeout_handler);
     1207         INIT_WORK(&ent->work, cmd_work_handler);
     1208         if (page_queue) {
     1209                 cmd_work_handler(&ent->work);
     1210         } else if (!queue_work(cmd->wq, &ent->work)) {
                             ^^^^^^^ this is what is causing to the splat
     1211                 mlx5_core_warn(dev, "failed to queue work\n");
     1212                 err = -EALREADY;
     1213                 goto out_free;
     1214         }

<...>
Is this problem known, maybe already fixed?
I don't see any missing Fixes that exist in 6.0 and don't exist in 5.5.32.
Sorry it is 5.15.32

Is it possible to reproduce this on latest upstream code?
I haven't been able to reproduce it, as mentioned above, I tried to
reproduce by simply reboot in loop, no luck yet.
do you have suggestions to speedup the reproduction?
Maybe try to shutdown during filling command interface.
I think that any query command will do the trick.
Just an update.
I tried to run "saquery" in a loop in one session and do "modproble -r
mlx5_ib && modprobe mlx5_ib" in loop in another session during last
days , but still no luck. --c
Once I can reproduce, I can also try with kernel 6.0.
It will be great.

Thanks
Thanks!
Just want to mention, we see more crash during reboot, all the crash
we saw are all
Intel  Intel(R) Xeon(R) Gold 6338 CPU. We use the same HCA on
different servers. So I suspect the bug is related to Ice Lake server.

In case it matters, here is lspci attached.
Please try the following change on 5.15.32, let me know if it solves the
failure :
Thank you Moshe, I will test it on affected servers and report back the result.
Hi Moshe,

I've been running the reboot tests on 4 affected machines in parallel
for more than 6 hours,  in total did 300+ reboot, I can no longer
reproduce the crash. without the fix, I was able to reproduce 2 times
in 20 reboots.
So I think the bug is fixed.

Great !

I also did some basic functional test via RNBD/IPOIB, all look good.
Tested-by: Jack Wang <jinpu.wang@xxxxxxxxx>
Please provide a formal fix.

Will do.
Hi Moshe,
A gentle ping, when will you send the fix?

Thanks!

Hi, it is part of Saeed's mlx5 fixes patchset.

He sent it a couple of hours ago.


Thanks!

Thx!

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/cmd.c
b/drivers/net/ethernet/mellanox/mlx5/core/cmd.c
index e06a6104e91f..d45ca9c52a21 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/cmd.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/cmd.c
@@ -971,6 +971,7 @@ static void cmd_work_handler(struct work_struct *work)
                   cmd_ent_get(ent);
           set_bit(MLX5_CMD_ENT_STATE_PENDING_COMP, &ent->state);

+       cmd_ent_get(ent); /* for the _real_ FW event on completion */
           /* Skip sending command to fw if internal error */
           if (mlx5_cmd_is_down(dev) || !opcode_allowed(&dev->cmd, ent->op)) {
                   u8 status = 0;
@@ -984,7 +985,6 @@ static void cmd_work_handler(struct work_struct *work)
                   return;
           }

-       cmd_ent_get(ent); /* for the _real_ FW event on completion */
           /* ring doorbell after the descriptor is valid */
           mlx5_core_dbg(dev, "writing 0x%x to command doorbell\n", 1 <<
ent->idx);
           wmb();
@@ -1598,8 +1598,8 @@ static void mlx5_cmd_comp_handler(struct
mlx5_core_dev *dev, u64 vec, bool force
                                   cmd_ent_put(ent); /* timeout work was
canceled */

                           if (!forced || /* Real FW completion */
-                           pci_channel_offline(dev->pdev) || /* FW is
inaccessible */
-                           dev->state == MLX5_DEVICE_STATE_INTERNAL_ERROR)
+                            mlx5_cmd_is_down(dev) || /* No real FW
completion is expected */
+                            !opcode_allowed(cmd, ent->op))
                                   cmd_ent_put(ent);

                           ent->ts2 = ktime_get_ns();

Thx!