Re: [BUG] mlx5_core general protection fault in mlx5_cmd_comp_handler

Jinpu Wang <jinpu.wang@xxxxxxxxx> · Tue, 22 Nov 2022 07:08:10 +0100



On Tue, Nov 22, 2022 at 5:31 AM Moshe Shemesh <moshe@xxxxxxxxxx> wrote:
>
>
> On 11/21/2022 11:11 AM, Jinpu Wang wrote:
> > External email: Use caution opening links or attachments
> >
> >
> > On Tue, Nov 15, 2022 at 5:41 PM Moshe Shemesh <moshe@xxxxxxxxxx> wrote:
> >>
> >> On 11/15/2022 5:08 PM, Jinpu Wang wrote:
> >>> On Tue, Nov 15, 2022 at 6:46 AM Jinpu Wang <jinpu.wang@xxxxxxxxx> wrote:
> >>>> On Tue, Nov 15, 2022 at 6:15 AM Moshe Shemesh <moshe@xxxxxxxxxx> wrote:
> >>>>> On 11/9/2022 11:51 AM, Jinpu Wang wrote:
> >>>>>> On Mon, Oct 17, 2022 at 7:54 AM Jinpu Wang <jinpu.wang@xxxxxxxxx> wrote:
> >>>>>>> On Thu, Oct 13, 2022 at 12:27 PM Leon Romanovsky <leon@xxxxxxxxxx> wrote:
> >>>>>>>> On Thu, Oct 13, 2022 at 10:32:55AM +0200, Jinpu Wang wrote:
> >>>>>>>>> On Thu, Oct 13, 2022 at 10:18 AM Leon Romanovsky <leon@xxxxxxxxxx> wrote:
> >>>>>>>>>> On Wed, Oct 12, 2022 at 01:55:55PM +0200, Jinpu Wang wrote:
> >>>>>>>>>>> Hi Leon, hi Saeed,
> >>>>>>>>>>>
> >>>>>>>>>>> We have seen crashes during server shutdown on both kernel 5.10 and
> >>>>>>>>>>> kernel 5.15 with GPF in mlx5 mlx5_cmd_comp_handler function.
> >>>>>>>>>>>
> >>>>>>>>>>> All of the crashes point to
> >>>>>>>>>>>
> >>>>>>>>>>> 1606                         memcpy(ent->out->first.data,
> >>>>>>>>>>> ent->lay->out, sizeof(ent->lay->out));
> >>>>>>>>>>>
> >>>>>>>>>>> I guess, it's kind of use after free for ent buffer. I tried to reprod
> >>>>>>>>>>> by repeatedly reboot the testing servers, but no success  so far.
> >>>>>>>>>> My guess is that command interface is not flushed, but Moshe and me
> >>>>>>>>>> didn't see how it can happen.
> >>>>>>>>>>
> >>>>>>>>>>      1206         INIT_DELAYED_WORK(&ent->cb_timeout_work, cb_timeout_handler);
> >>>>>>>>>>      1207         INIT_WORK(&ent->work, cmd_work_handler);
> >>>>>>>>>>      1208         if (page_queue) {
> >>>>>>>>>>      1209                 cmd_work_handler(&ent->work);
> >>>>>>>>>>      1210         } else if (!queue_work(cmd->wq, &ent->work)) {
> >>>>>>>>>>                              ^^^^^^^ this is what is causing to the splat
> >>>>>>>>>>      1211                 mlx5_core_warn(dev, "failed to queue work\n");
> >>>>>>>>>>      1212                 err = -EALREADY;
> >>>>>>>>>>      1213                 goto out_free;
> >>>>>>>>>>      1214         }
> >>>>>>>>>>
> >>>>>>>>>> <...>
> >>>>>>>>>>> Is this problem known, maybe already fixed?
> >>>>>>>>>> I don't see any missing Fixes that exist in 6.0 and don't exist in 5.5.32.
> >>>>>>>> Sorry it is 5.15.32
> >>>>>>>>
> >>>>>>>>>> Is it possible to reproduce this on latest upstream code?
> >>>>>>>>> I haven't been able to reproduce it, as mentioned above, I tried to
> >>>>>>>>> reproduce by simply reboot in loop, no luck yet.
> >>>>>>>>> do you have suggestions to speedup the reproduction?
> >>>>>>>> Maybe try to shutdown during filling command interface.
> >>>>>>>> I think that any query command will do the trick.
> >>>>>>> Just an update.
> >>>>>>> I tried to run "saquery" in a loop in one session and do "modproble -r
> >>>>>>> mlx5_ib && modprobe mlx5_ib" in loop in another session during last
> >>>>>>> days , but still no luck. --c
> >>>>>>>>> Once I can reproduce, I can also try with kernel 6.0.
> >>>>>>>> It will be great.
> >>>>>>>>
> >>>>>>>> Thanks
> >>>>>>> Thanks!
> >>>>>> Just want to mention, we see more crash during reboot, all the crash
> >>>>>> we saw are all
> >>>>>> Intel  Intel(R) Xeon(R) Gold 6338 CPU. We use the same HCA on
> >>>>>> different servers. So I suspect the bug is related to Ice Lake server.
> >>>>>>
> >>>>>> In case it matters, here is lspci attached.
> >>>>> Please try the following change on 5.15.32, let me know if it solves the
> >>>>> failure :
> >>>> Thank you Moshe, I will test it on affected servers and report back the result.
> >>> Hi Moshe,
> >>>
> >>> I've been running the reboot tests on 4 affected machines in parallel
> >>> for more than 6 hours,  in total did 300+ reboot, I can no longer
> >>> reproduce the crash. without the fix, I was able to reproduce 2 times
> >>> in 20 reboots.
> >>> So I think the bug is fixed.
> >>
> >> Great !
> >>
> >>> I also did some basic functional test via RNBD/IPOIB, all look good.
> >>> Tested-by: Jack Wang <jinpu.wang@xxxxxxxxx>
> >>> Please provide a formal fix.
> >>
> >> Will do.
> > Hi Moshe,
> > A gentle ping, when will you send the fix?
> >
> > Thanks!
>
> Hi, it is part of Saeed's mlx5 fixes patchset.
>
> He sent it a couple of hours ago.
Yes, indeed.
ref: https://lore.kernel.org/netdev/20221122022559.89459-6-saeed@xxxxxxxxxx/T/#u

Thx!
>
> >
> >> Thanks!
> >>
> >>> Thx!
> >>>>> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/cmd.c
> >>>>> b/drivers/net/ethernet/mellanox/mlx5/core/cmd.c
> >>>>> index e06a6104e91f..d45ca9c52a21 100644
> >>>>> --- a/drivers/net/ethernet/mellanox/mlx5/core/cmd.c
> >>>>> +++ b/drivers/net/ethernet/mellanox/mlx5/core/cmd.c
> >>>>> @@ -971,6 +971,7 @@ static void cmd_work_handler(struct work_struct *work)
> >>>>>                    cmd_ent_get(ent);
> >>>>>            set_bit(MLX5_CMD_ENT_STATE_PENDING_COMP, &ent->state);
> >>>>>
> >>>>> +       cmd_ent_get(ent); /* for the _real_ FW event on completion */
> >>>>>            /* Skip sending command to fw if internal error */
> >>>>>            if (mlx5_cmd_is_down(dev) || !opcode_allowed(&dev->cmd, ent->op)) {
> >>>>>                    u8 status = 0;
> >>>>> @@ -984,7 +985,6 @@ static void cmd_work_handler(struct work_struct *work)
> >>>>>                    return;
> >>>>>            }
> >>>>>
> >>>>> -       cmd_ent_get(ent); /* for the _real_ FW event on completion */
> >>>>>            /* ring doorbell after the descriptor is valid */
> >>>>>            mlx5_core_dbg(dev, "writing 0x%x to command doorbell\n", 1 <<
> >>>>> ent->idx);
> >>>>>            wmb();
> >>>>> @@ -1598,8 +1598,8 @@ static void mlx5_cmd_comp_handler(struct
> >>>>> mlx5_core_dev *dev, u64 vec, bool force
> >>>>>                                    cmd_ent_put(ent); /* timeout work was
> >>>>> canceled */
> >>>>>
> >>>>>                            if (!forced || /* Real FW completion */
> >>>>> -                           pci_channel_offline(dev->pdev) || /* FW is
> >>>>> inaccessible */
> >>>>> -                           dev->state == MLX5_DEVICE_STATE_INTERNAL_ERROR)
> >>>>> +                            mlx5_cmd_is_down(dev) || /* No real FW
> >>>>> completion is expected */
> >>>>> +                            !opcode_allowed(cmd, ent->op))
> >>>>>                                    cmd_ent_put(ent);
> >>>>>
> >>>>>                            ent->ts2 = ktime_get_ns();
> >>>>>
> >>>>>> Thx!