On Thu, Oct 13, 2022 at 10:18 AM Leon Romanovsky <leon@xxxxxxxxxx> wrote: > > On Wed, Oct 12, 2022 at 01:55:55PM +0200, Jinpu Wang wrote: > > Hi Leon, hi Saeed, > > > > We have seen crashes during server shutdown on both kernel 5.10 and > > kernel 5.15 with GPF in mlx5 mlx5_cmd_comp_handler function. > > > > All of the crashes point to > > > > 1606 memcpy(ent->out->first.data, > > ent->lay->out, sizeof(ent->lay->out)); > > > > I guess, it's kind of use after free for ent buffer. I tried to reprod > > by repeatedly reboot the testing servers, but no success so far. > > My guess is that command interface is not flushed, but Moshe and me > didn't see how it can happen. > > 1206 INIT_DELAYED_WORK(&ent->cb_timeout_work, cb_timeout_handler); > 1207 INIT_WORK(&ent->work, cmd_work_handler); > 1208 if (page_queue) { > 1209 cmd_work_handler(&ent->work); > 1210 } else if (!queue_work(cmd->wq, &ent->work)) { > ^^^^^^^ this is what is causing to the splat > 1211 mlx5_core_warn(dev, "failed to queue work\n"); > 1212 err = -EALREADY; > 1213 goto out_free; > 1214 } > > <...> > > > > Is this problem known, maybe already fixed? > > I don't see any missing Fixes that exist in 6.0 and don't exist in 5.5.32. > Is it possible to reproduce this on latest upstream code? I haven't been able to reproduce it, as mentioned above, I tried to reproduce by simply reboot in loop, no luck yet. do you have suggestions to speedup the reproduction? Once I can reproduce, I can also try with kernel 6.0. > And what is your FW version? here is ibstat output CA 'mlx5_0' CA type: MT4119 Number of ports: 1 Firmware version: 16.27.2008 Hardware version: 0 Node GUID: 0x08c0eb030054b372 System image GUID: 0x08c0eb030054b372 Port 1: State: Active Physical state: LinkUp Rate: 100 Base lid: 15 LMC: 0 SM lid: 1 Capability mask: 0x2651e848 Port GUID: 0x08c0eb030054b372 Link layer: InfiniBand CA 'mlx5_1' CA type: MT4119 Number of ports: 1 Firmware version: 16.27.2008 Hardware version: 0 Node GUID: 0x08c0eb030054b373 System image GUID: 0x08c0eb030054b372 Port 1: State: Active Physical state: LinkUp Rate: 100 Base lid: 12 LMC: 0 SM lid: 4 Capability mask: 0x2651e848 Port GUID: 0x08c0eb030054b373 Link layer: InfiniBand Thanks for your help! > > > > I briefly checked the git, don't see anything, could you give me some hint? > > > > > > Thanks! > > Jinpu Wang @ IONOS