Re: [PATCH vhost v2 00/10] vdpa/mlx5: Parallelize device suspend/resume

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Dragos

QE tested this series with mellanox nic, it failed with [1] when
booting guest, and host dmesg also will print messages [2]. This bug
can be reproduced boot guest with vhost-vdpa device.

[1] qemu) qemu-kvm: vhost VQ 1 ring restore failed: -1: Operation not
permitted (1)
qemu-kvm: vhost VQ 0 ring restore failed: -1: Operation not permitted (1)
qemu-kvm: unable to start vhost net: 5: falling back on userspace virtio
qemu-kvm: vhost_set_features failed: Device or resource busy (16)
qemu-kvm: unable to start vhost net: 16: falling back on userspace virtio

[2] Host dmesg:
[ 1406.187977] mlx5_core 0000:0d:00.2:
mlx5_vdpa_compat_reset:3267:(pid 8506): performing device reset
[ 1406.189221] mlx5_core 0000:0d:00.2:
mlx5_vdpa_compat_reset:3267:(pid 8506): performing device reset
[ 1406.190354] mlx5_core 0000:0d:00.2:
mlx5_vdpa_show_mr_leaks:573:(pid 8506) warning: mkey still alive after
resource delete: mr: 000000000c5ccca2, mkey: 0x40000000, refcount: 2
[ 1471.538487] mlx5_core 0000:0d:00.2: cb_timeout_handler:938:(pid
428): cmd[13]: MODIFY_GENERAL_OBJECT(0xa01) Async, timeout. Will cause
a leak of a command resource
[ 1471.539486] mlx5_core 0000:0d:00.2: cb_timeout_handler:938:(pid
428): cmd[12]: MODIFY_GENERAL_OBJECT(0xa01) Async, timeout. Will cause
a leak of a command resource
[ 1471.540351] mlx5_core 0000:0d:00.2: modify_virtqueues:1617:(pid
8511) error: modify vq 0 failed, state: 0 -> 0, err: 0
[ 1471.541433] mlx5_core 0000:0d:00.2: modify_virtqueues:1617:(pid
8511) error: modify vq 1 failed, state: 0 -> 0, err: -110
[ 1471.542388] mlx5_core 0000:0d:00.2: mlx5_vdpa_set_status:3203:(pid
8511) warning: failed to resume VQs
[ 1471.549778] mlx5_core 0000:0d:00.2:
mlx5_vdpa_show_mr_leaks:573:(pid 8511) warning: mkey still alive after
resource delete: mr: 000000000c5ccca2, mkey: 0x40000000, refcount: 2
[ 1512.929854] mlx5_core 0000:0d:00.2:
mlx5_vdpa_compat_reset:3267:(pid 8565): performing device reset
[ 1513.100290] mlx5_core 0000:0d:00.2:
mlx5_vdpa_show_mr_leaks:573:(pid 8565) warning: mkey still alive after
resource delete: mr: 000000000c5ccca2, mkey: 0x40000000, refcount: 2

Thanks
Lei




> This series parallelizes the mlx5_vdpa device suspend and resume
> operations through the firmware async API. The purpose is to reduce live
> migration downtime.
>
> The series starts with changing the VQ suspend and resume commands
> to the async API. After that, the switch is made to issue multiple
> commands of the same type in parallel.
>
> Then, the an additional improvement is added: keep the notifiers enabled
> during suspend but make it a NOP. Upon resume make sure that the link
> state is forwarded. This shaves around 30ms per device constant time.
>
> Finally, use parallel VQ suspend and resume during the CVQ MQ command.
>
> For 1 vDPA device x 32 VQs (16 VQPs), on a large VM (256 GB RAM, 32 CPUs
> x 2 threads per core), the improvements are:
>
> +-------------------+--------+--------+-----------+
> | operation         | Before | After  | Reduction |
> |-------------------+--------+--------+-----------|
> | mlx5_vdpa_suspend | 37 ms  | 2.5 ms |     14x   |
> | mlx5_vdpa_resume  | 16 ms  | 5 ms   |      3x   |
> +-------------------+--------+--------+-----------+
>
> ---
> v2:
> - Changed to parallel VQ suspend/resume during CVQ MQ command.
>   Support added in the last 2 patches.
> - Made the fw async command more generic and moved it to resources.c.
>   Did that because the following series (parallel mkey ops) needs this
>   code as well.
>   Dropped Acked-by from Eugenio on modified patches.
> - Fixed kfree -> kvfree.
> - Removed extra newline caught during review.
> - As discussed in the v1, the series can be pulled in completely in
>   the vhost tree [0]. The mlx5_core patch was reviewed by Tariq who is
>   also a maintainer for mlx5_core.
>
> [0] - https://lore.kernel.org/virtualization/6582792d-8db2-4bc0-bf3a-248fe5c8fc56@xxxxxxxxxx/T/#maefabb2fde5adfb322d16ca16ae64d540f75b7d2
>
> Dragos Tatulea (10):
>   net/mlx5: Support throttled commands from async API
>   vdpa/mlx5: Introduce error logging function
>   vdpa/mlx5: Introduce async fw command wrapper
>   vdpa/mlx5: Use async API for vq query command
>   vdpa/mlx5: Use async API for vq modify commands
>   vdpa/mlx5: Parallelize device suspend
>   vdpa/mlx5: Parallelize device resume
>   vdpa/mlx5: Keep notifiers during suspend but ignore
>   vdpa/mlx5: Small improvement for change_num_qps()
>   vdpa/mlx5: Parallelize VQ suspend/resume for CVQ MQ command
>
>  drivers/net/ethernet/mellanox/mlx5/core/cmd.c |  21 +-
>  drivers/vdpa/mlx5/core/mlx5_vdpa.h            |  22 +
>  drivers/vdpa/mlx5/core/resources.c            |  73 ++++
>  drivers/vdpa/mlx5/net/mlx5_vnet.c             | 396 +++++++++++-------
>  4 files changed, 361 insertions(+), 151 deletions(-)
>
> --
> 2.45.1
>





[Index of Archives]     [KVM ARM]     [KVM ia64]     [KVM ppc]     [Virtualization Tools]     [Spice Development]     [Libvirt]     [Libvirt Users]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite Questions]     [Linux Kernel]     [Linux SCSI]     [XFree86]

  Powered by Linux