RE: [PATCH net-next v5 4/4] virtio-net: support rx netdim

Yinjun Zhang <yinjun.zhang@xxxxxxxxxxxx> · Fri, 1 Dec 2023 02:11:08 +0000

On Thursday, November 30, 2023 8:42 PM, Heng Qi wrote:
<...>
> >>>>    static void virtnet_remove(struct virtio_device *vdev)
> >>>>    {
> >>>>            struct virtnet_info *vi = vdev->priv;
> >>>> +  int i;
> >>>>
> >>>>            virtnet_cpu_notif_remove(vi);
> >>>>
> >>>>            /* Make sure no work handler is accessing the device. */
> >>>>            flush_work(&vi->config_work);
> >>>> +  for (i = 0; i < vi->max_queue_pairs; i++)
> >>>> +          cancel_work(&vi->rq[i].dim.work);
<...> 
> There's cancel_work_sync() in v4 and I did reproduce the deadlock.
> 
> rtnl_lock held -> .ndo_stop() -> cancel_work_sync() ->
> virtnet_rx_dim_work(),
> the work acquires the rtnl_lock again, then a deadlock occurs.
> 
> I tested the scenario of ctrl cmd/.remove/.ndo_stop()/dim_work when there
> is
> a big concurrency, and cancel_work() works well.

I think the question here is why do you need call `cancel_work()` in `remove()`?
You already call it in `close()`, and the callstack is:
remove() ->  unregister_netdev() -> rtnl_lock() -> ndo_stop() -> close()

And similarly, you don't need it in the unwind path in `probe()` either.

> 
<...>