On Thursday, November 30, 2023 8:42 PM, Heng Qi wrote: <...> > >>>> static void virtnet_remove(struct virtio_device *vdev) > >>>> { > >>>> struct virtnet_info *vi = vdev->priv; > >>>> + int i; > >>>> > >>>> virtnet_cpu_notif_remove(vi); > >>>> > >>>> /* Make sure no work handler is accessing the device. */ > >>>> flush_work(&vi->config_work); > >>>> + for (i = 0; i < vi->max_queue_pairs; i++) > >>>> + cancel_work(&vi->rq[i].dim.work); <...> > There's cancel_work_sync() in v4 and I did reproduce the deadlock. > > rtnl_lock held -> .ndo_stop() -> cancel_work_sync() -> > virtnet_rx_dim_work(), > the work acquires the rtnl_lock again, then a deadlock occurs. > > I tested the scenario of ctrl cmd/.remove/.ndo_stop()/dim_work when there > is > a big concurrency, and cancel_work() works well. I think the question here is why do you need call `cancel_work()` in `remove()`? You already call it in `close()`, and the callstack is: remove() -> unregister_netdev() -> rtnl_lock() -> ndo_stop() -> close() And similarly, you don't need it in the unwind path in `probe()` either. > <...>