Re: [BUG] mlx5_core general protection fault in mlx5_cmd_comp_handler

Jinpu Wang <jinpu.wang@xxxxxxxxx> · Mon, 21 Nov 2022 10:11:26 +0100

On Tue, Nov 15, 2022 at 5:41 PM Moshe Shemesh <moshe@xxxxxxxxxx> wrote:
>
>
> On 11/15/2022 5:08 PM, Jinpu Wang wrote:
> > On Tue, Nov 15, 2022 at 6:46 AM Jinpu Wang <jinpu.wang@xxxxxxxxx> wrote:
> >> On Tue, Nov 15, 2022 at 6:15 AM Moshe Shemesh <moshe@xxxxxxxxxx> wrote:
> >>>
> >>> On 11/9/2022 11:51 AM, Jinpu Wang wrote:
> >>>> On Mon, Oct 17, 2022 at 7:54 AM Jinpu Wang <jinpu.wang@xxxxxxxxx> wrote:
> >>>>> On Thu, Oct 13, 2022 at 12:27 PM Leon Romanovsky <leon@xxxxxxxxxx> wrote:
> >>>>>> On Thu, Oct 13, 2022 at 10:32:55AM +0200, Jinpu Wang wrote:
> >>>>>>> On Thu, Oct 13, 2022 at 10:18 AM Leon Romanovsky <leon@xxxxxxxxxx> wrote:
> >>>>>>>> On Wed, Oct 12, 2022 at 01:55:55PM +0200, Jinpu Wang wrote:
> >>>>>>>>> Hi Leon, hi Saeed,
> >>>>>>>>>
> >>>>>>>>> We have seen crashes during server shutdown on both kernel 5.10 and
> >>>>>>>>> kernel 5.15 with GPF in mlx5 mlx5_cmd_comp_handler function.
> >>>>>>>>>
> >>>>>>>>> All of the crashes point to
> >>>>>>>>>
> >>>>>>>>> 1606                         memcpy(ent->out->first.data,
> >>>>>>>>> ent->lay->out, sizeof(ent->lay->out));
> >>>>>>>>>
> >>>>>>>>> I guess, it's kind of use after free for ent buffer. I tried to reprod
> >>>>>>>>> by repeatedly reboot the testing servers, but no success  so far.
> >>>>>>>> My guess is that command interface is not flushed, but Moshe and me
> >>>>>>>> didn't see how it can happen.
> >>>>>>>>
> >>>>>>>>     1206         INIT_DELAYED_WORK(&ent->cb_timeout_work, cb_timeout_handler);
> >>>>>>>>     1207         INIT_WORK(&ent->work, cmd_work_handler);
> >>>>>>>>     1208         if (page_queue) {
> >>>>>>>>     1209                 cmd_work_handler(&ent->work);
> >>>>>>>>     1210         } else if (!queue_work(cmd->wq, &ent->work)) {
> >>>>>>>>                             ^^^^^^^ this is what is causing to the splat
> >>>>>>>>     1211                 mlx5_core_warn(dev, "failed to queue work\n");
> >>>>>>>>     1212                 err = -EALREADY;
> >>>>>>>>     1213                 goto out_free;
> >>>>>>>>     1214         }
> >>>>>>>>
> >>>>>>>> <...>
> >>>>>>>>> Is this problem known, maybe already fixed?
> >>>>>>>> I don't see any missing Fixes that exist in 6.0 and don't exist in 5.5.32.
> >>>>>> Sorry it is 5.15.32
> >>>>>>
> >>>>>>>> Is it possible to reproduce this on latest upstream code?
> >>>>>>> I haven't been able to reproduce it, as mentioned above, I tried to
> >>>>>>> reproduce by simply reboot in loop, no luck yet.
> >>>>>>> do you have suggestions to speedup the reproduction?
> >>>>>> Maybe try to shutdown during filling command interface.
> >>>>>> I think that any query command will do the trick.
> >>>>> Just an update.
> >>>>> I tried to run "saquery" in a loop in one session and do "modproble -r
> >>>>> mlx5_ib && modprobe mlx5_ib" in loop in another session during last
> >>>>> days , but still no luck. --c
> >>>>>>> Once I can reproduce, I can also try with kernel 6.0.
> >>>>>> It will be great.
> >>>>>>
> >>>>>> Thanks
> >>>>> Thanks!
> >>>> Just want to mention, we see more crash during reboot, all the crash
> >>>> we saw are all
> >>>> Intel  Intel(R) Xeon(R) Gold 6338 CPU. We use the same HCA on
> >>>> different servers. So I suspect the bug is related to Ice Lake server.
> >>>>
> >>>> In case it matters, here is lspci attached.
> >>>
> >>> Please try the following change on 5.15.32, let me know if it solves the
> >>> failure :
> >> Thank you Moshe, I will test it on affected servers and report back the result.
> > Hi Moshe,
> >
> > I've been running the reboot tests on 4 affected machines in parallel
> > for more than 6 hours,  in total did 300+ reboot, I can no longer
> > reproduce the crash. without the fix, I was able to reproduce 2 times
> > in 20 reboots.
> > So I think the bug is fixed.
>
>
> Great !
>
> > I also did some basic functional test via RNBD/IPOIB, all look good.
> > Tested-by: Jack Wang <jinpu.wang@xxxxxxxxx>
> > Please provide a formal fix.
>
>
> Will do.
Hi Moshe,
A gentle ping, when will you send the fix?

Thanks!

>
> Thanks!
>
> >
> > Thx!
> >>> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/cmd.c
> >>> b/drivers/net/ethernet/mellanox/mlx5/core/cmd.c
> >>> index e06a6104e91f..d45ca9c52a21 100644
> >>> --- a/drivers/net/ethernet/mellanox/mlx5/core/cmd.c
> >>> +++ b/drivers/net/ethernet/mellanox/mlx5/core/cmd.c
> >>> @@ -971,6 +971,7 @@ static void cmd_work_handler(struct work_struct *work)
> >>>                   cmd_ent_get(ent);
> >>>           set_bit(MLX5_CMD_ENT_STATE_PENDING_COMP, &ent->state);
> >>>
> >>> +       cmd_ent_get(ent); /* for the _real_ FW event on completion */
> >>>           /* Skip sending command to fw if internal error */
> >>>           if (mlx5_cmd_is_down(dev) || !opcode_allowed(&dev->cmd, ent->op)) {
> >>>                   u8 status = 0;
> >>> @@ -984,7 +985,6 @@ static void cmd_work_handler(struct work_struct *work)
> >>>                   return;
> >>>           }
> >>>
> >>> -       cmd_ent_get(ent); /* for the _real_ FW event on completion */
> >>>           /* ring doorbell after the descriptor is valid */
> >>>           mlx5_core_dbg(dev, "writing 0x%x to command doorbell\n", 1 <<
> >>> ent->idx);
> >>>           wmb();
> >>> @@ -1598,8 +1598,8 @@ static void mlx5_cmd_comp_handler(struct
> >>> mlx5_core_dev *dev, u64 vec, bool force
> >>>                                   cmd_ent_put(ent); /* timeout work was
> >>> canceled */
> >>>
> >>>                           if (!forced || /* Real FW completion */
> >>> -                           pci_channel_offline(dev->pdev) || /* FW is
> >>> inaccessible */
> >>> -                           dev->state == MLX5_DEVICE_STATE_INTERNAL_ERROR)
> >>> +                            mlx5_cmd_is_down(dev) || /* No real FW
> >>> completion is expected */
> >>> +                            !opcode_allowed(cmd, ent->op))
> >>>                                   cmd_ent_put(ent);
> >>>
> >>>                           ent->ts2 = ktime_get_ns();
> >>>
> >>>> Thx!