Hi Dragos QE tested this series with mellanox nic, it failed with [1] when booting guest, and host dmesg also will print messages [2]. This bug can be reproduced boot guest with vhost-vdpa device. [1] qemu) qemu-kvm: vhost VQ 1 ring restore failed: -1: Operation not permitted (1) qemu-kvm: vhost VQ 0 ring restore failed: -1: Operation not permitted (1) qemu-kvm: unable to start vhost net: 5: falling back on userspace virtio qemu-kvm: vhost_set_features failed: Device or resource busy (16) qemu-kvm: unable to start vhost net: 16: falling back on userspace virtio [2] Host dmesg: [ 1406.187977] mlx5_core 0000:0d:00.2: mlx5_vdpa_compat_reset:3267:(pid 8506): performing device reset [ 1406.189221] mlx5_core 0000:0d:00.2: mlx5_vdpa_compat_reset:3267:(pid 8506): performing device reset [ 1406.190354] mlx5_core 0000:0d:00.2: mlx5_vdpa_show_mr_leaks:573:(pid 8506) warning: mkey still alive after resource delete: mr: 000000000c5ccca2, mkey: 0x40000000, refcount: 2 [ 1471.538487] mlx5_core 0000:0d:00.2: cb_timeout_handler:938:(pid 428): cmd[13]: MODIFY_GENERAL_OBJECT(0xa01) Async, timeout. Will cause a leak of a command resource [ 1471.539486] mlx5_core 0000:0d:00.2: cb_timeout_handler:938:(pid 428): cmd[12]: MODIFY_GENERAL_OBJECT(0xa01) Async, timeout. Will cause a leak of a command resource [ 1471.540351] mlx5_core 0000:0d:00.2: modify_virtqueues:1617:(pid 8511) error: modify vq 0 failed, state: 0 -> 0, err: 0 [ 1471.541433] mlx5_core 0000:0d:00.2: modify_virtqueues:1617:(pid 8511) error: modify vq 1 failed, state: 0 -> 0, err: -110 [ 1471.542388] mlx5_core 0000:0d:00.2: mlx5_vdpa_set_status:3203:(pid 8511) warning: failed to resume VQs [ 1471.549778] mlx5_core 0000:0d:00.2: mlx5_vdpa_show_mr_leaks:573:(pid 8511) warning: mkey still alive after resource delete: mr: 000000000c5ccca2, mkey: 0x40000000, refcount: 2 [ 1512.929854] mlx5_core 0000:0d:00.2: mlx5_vdpa_compat_reset:3267:(pid 8565): performing device reset [ 1513.100290] mlx5_core 0000:0d:00.2: mlx5_vdpa_show_mr_leaks:573:(pid 8565) warning: mkey still alive after resource delete: mr: 000000000c5ccca2, mkey: 0x40000000, refcount: 2 Thanks Lei > This series parallelizes the mlx5_vdpa device suspend and resume > operations through the firmware async API. The purpose is to reduce live > migration downtime. > > The series starts with changing the VQ suspend and resume commands > to the async API. After that, the switch is made to issue multiple > commands of the same type in parallel. > > Then, the an additional improvement is added: keep the notifiers enabled > during suspend but make it a NOP. Upon resume make sure that the link > state is forwarded. This shaves around 30ms per device constant time. > > Finally, use parallel VQ suspend and resume during the CVQ MQ command. > > For 1 vDPA device x 32 VQs (16 VQPs), on a large VM (256 GB RAM, 32 CPUs > x 2 threads per core), the improvements are: > > +-------------------+--------+--------+-----------+ > | operation | Before | After | Reduction | > |-------------------+--------+--------+-----------| > | mlx5_vdpa_suspend | 37 ms | 2.5 ms | 14x | > | mlx5_vdpa_resume | 16 ms | 5 ms | 3x | > +-------------------+--------+--------+-----------+ > > --- > v2: > - Changed to parallel VQ suspend/resume during CVQ MQ command. > Support added in the last 2 patches. > - Made the fw async command more generic and moved it to resources.c. > Did that because the following series (parallel mkey ops) needs this > code as well. > Dropped Acked-by from Eugenio on modified patches. > - Fixed kfree -> kvfree. > - Removed extra newline caught during review. > - As discussed in the v1, the series can be pulled in completely in > the vhost tree [0]. The mlx5_core patch was reviewed by Tariq who is > also a maintainer for mlx5_core. > > [0] - https://lore.kernel.org/virtualization/6582792d-8db2-4bc0-bf3a-248fe5c8fc56@xxxxxxxxxx/T/#maefabb2fde5adfb322d16ca16ae64d540f75b7d2 > > Dragos Tatulea (10): > net/mlx5: Support throttled commands from async API > vdpa/mlx5: Introduce error logging function > vdpa/mlx5: Introduce async fw command wrapper > vdpa/mlx5: Use async API for vq query command > vdpa/mlx5: Use async API for vq modify commands > vdpa/mlx5: Parallelize device suspend > vdpa/mlx5: Parallelize device resume > vdpa/mlx5: Keep notifiers during suspend but ignore > vdpa/mlx5: Small improvement for change_num_qps() > vdpa/mlx5: Parallelize VQ suspend/resume for CVQ MQ command > > drivers/net/ethernet/mellanox/mlx5/core/cmd.c | 21 +- > drivers/vdpa/mlx5/core/mlx5_vdpa.h | 22 + > drivers/vdpa/mlx5/core/resources.c | 73 ++++ > drivers/vdpa/mlx5/net/mlx5_vnet.c | 396 +++++++++++------- > 4 files changed, 361 insertions(+), 151 deletions(-) > > -- > 2.45.1 >