On 11/22/21 12:37 PM, Michael S. Tsirkin wrote: > On Fri, Nov 19, 2021 at 02:32:05PM +0300, Andrey Ryabinin wrote: >> >> >> On 11/16/21 8:00 AM, Jason Wang wrote: >>> On Mon, Nov 15, 2021 at 11:32 PM Andrey Ryabinin <arbn@xxxxxxxxxxxxxxx> wrote: >>>> >>>> Currently vhost_net_release() uses synchronize_rcu() to synchronize >>>> freeing with vhost_zerocopy_callback(). However synchronize_rcu() >>>> is quite costly operation. It take more than 10 seconds >>>> to shutdown qemu launched with couple net devices like this: >>>> -netdev tap,id=tap0,..,vhost=on,queues=80 >>>> because we end up calling synchronize_rcu() netdev_count*queues times. >>>> >>>> Free vhost net structures in rcu callback instead of using >>>> synchronize_rcu() to fix the problem. >>> >>> I admit the release code is somehow hard to understand. But I wonder >>> if the following case can still happen with this: >>> >>> CPU 0 (vhost_dev_cleanup) CPU1 >>> (vhost_net_zerocopy_callback()->vhost_work_queue()) >>> if (!dev->worker) >>> dev->worker = NULL >>> >>> wake_up_process(dev->worker) >>> >>> If this is true. It seems the fix is to move RCU synchronization stuff >>> in vhost_net_ubuf_put_and_wait()? >>> >> >> It all depends whether vhost_zerocopy_callback() can be called outside of vhost >> thread context or not. If it can run after vhost thread stopped, than the race you >> describe seems possible and the fix in commit b0c057ca7e83 ("vhost: fix a theoretical race in device cleanup") >> wasn't complete. I would fix it by calling synchronize_rcu() after vhost_net_flush() >> and before vhost_dev_cleanup(). >> >> As for the performance problem, it can be solved by replacing synchronize_rcu() with synchronize_rcu_expedited(). > > expedited causes a stop of IPIs though, so it's problematic to > do it upon a userspace syscall. > How about something like this? --- drivers/vhost/net.c | 40 ++++++++++++++++++++++++++-------------- 1 file changed, 26 insertions(+), 14 deletions(-) diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c index 97a209d6a527..556df26c584d 100644 --- a/drivers/vhost/net.c +++ b/drivers/vhost/net.c @@ -144,6 +144,10 @@ struct vhost_net { struct page_frag page_frag; /* Refcount bias of page frag */ int refcnt_bias; + + struct socket *tx_sock; + struct socket *rx_sock; + struct rcu_work rwork; }; static unsigned vhost_net_zcopy_mask __read_mostly; @@ -1389,6 +1393,24 @@ static void vhost_net_flush(struct vhost_net *n) } } +static void vhost_net_cleanup(struct work_struct *work) +{ + struct vhost_net *n = + container_of(to_rcu_work(work), struct vhost_net, rwork); + vhost_dev_cleanup(&n->dev); + vhost_net_vq_reset(n); + if (n->tx_sock) + sockfd_put(n->tx_sock); + if (n->rx_sock) + sockfd_put(n->rx_sock); + kfree(n->vqs[VHOST_NET_VQ_RX].rxq.queue); + kfree(n->vqs[VHOST_NET_VQ_TX].xdp); + kfree(n->dev.vqs); + if (n->page_frag.page) + __page_frag_cache_drain(n->page_frag.page, n->refcnt_bias); + kvfree(n); +} + static int vhost_net_release(struct inode *inode, struct file *f) { struct vhost_net *n = f->private_data; @@ -1398,21 +1420,11 @@ static int vhost_net_release(struct inode *inode, struct file *f) vhost_net_stop(n, &tx_sock, &rx_sock); vhost_net_flush(n); vhost_dev_stop(&n->dev); - vhost_dev_cleanup(&n->dev); - vhost_net_vq_reset(n); - if (tx_sock) - sockfd_put(tx_sock); - if (rx_sock) - sockfd_put(rx_sock); - /* Make sure no callbacks are outstanding */ - synchronize_rcu(); + n->tx_sock = tx_sock; + n->rx_sock = rx_sock; - kfree(n->vqs[VHOST_NET_VQ_RX].rxq.queue); - kfree(n->vqs[VHOST_NET_VQ_TX].xdp); - kfree(n->dev.vqs); - if (n->page_frag.page) - __page_frag_cache_drain(n->page_frag.page, n->refcnt_bias); - kvfree(n); + INIT_RCU_WORK(&n->rwork, vhost_net_cleanup); + queue_rcu_work(system_wq, &n->rwork); return 0; } --