Re: [Questions] non-shared disk migration: jobs abort and bandwidth

Han Han <hhan@xxxxxxxxxx> · Thu, 9 Jun 2022 14:52:13 +0800

On Wed, Jun 8, 2022 at 6:49 PM Peter Krempa <pkrempa@xxxxxxxxxx> wrote:
On Wed, Jun 08, 2022 at 17:32:57 +0800, Han Han wrote:

> Hi developers,

> Recently, I am researching migration with non-share disk(flags

> VIR_MIGRATE_NON_SHARED_DISK and VIR_MIGRATE_NON_SHARED_INC).

> As we know, the non-shared disk migration could have block jobs to copy the

> disk image from the src host to the dst host. So here are my questions for

> non-shared disk migration:

> q1. For the API virDomainMigrate3 with the bandwidth param, could it set

> the bandwidth of block jobs?

> q2. For the API virDomainMigrateSetMaxSpeed, could it set the bandwidth of

> block jobs?

> q3. For the domain job abort API virDomainAbortJob, could it stop the block

> job of non-shared disk migration?

> q4. For the block job bandwidth API virDomainBlockJobSetSpeed, could it set

> the block job of non-shared disk migration?

> q5. For the block job abort API virDomainBlockJobAbort, could it stop the

> block job of non-shared disk migration?

> 

> 

> 

> Then I got the test results of libvirt-8.4.0-1.el9.x86_64

> qemu-kvm-7.0.0-4.el9.x86_64:

> q1: The bandwidth limit of virDomainMigrate3 is effective to the blockjob:

> ➜  ~ virsh migrate OVMF qemu+ssh://root@hhan-rhel9--1/system --live --p2p

> --tls --tls-destination hhan-rhel9--1 --copy-storage-all --disks-uri

> tcp://hhan-rhel9--1:49156 --bandwidth 2

> ➜  ~ virsh blockjob OVMF vda

> Block Copy: [  0 %]    Bandwidth limit: 2097152 bytes/s (2.000 MiB/s)

This is expected and desired.

> q2: The virDomainMigrateSetMaxSpeed doesn't change the the bandwidth of

> block jobs.

> ➜  ~ virsh migrate-setspeed OVMF 8

> 

> ➜  ~ virsh blockjob OVMF vda

> Block Copy: [  9 %]    Bandwidth limit: 2097152 bytes/s (2.000 MiB/s)

This is a bug though, setting the migration speed should, based on the

fact that  we want to use the global migration speed flag for disks too

, apply also to the disk migration streams.
File a bug here: https://bugzilla.redhat.com/show_bug.cgi?id=2095093

> q3: The virDomainAbortJob could stop a block job of non-shared disk

> migration

> ➜  ~ virsh migrate OVMF qemu+ssh://root@hhan-rhel9--1/system --live --p2p

> --tls --tls-destination hhan-rhel9--1 --copy-storage-all --disks-uri

> tcp://hhan-rhel9--1:49156 --bandwidth 2

> Then start a virsh event on another terminal:

> ➜  ~ virsh event --loop --all

> 

> Abort the domain job:

> ➜  ~ virsh domjobabort OVMF

> 

> The error "error: operation aborted: migration out: canceled by client"

> appears at the terminal of "virsh migrate"

> The terminal of "virsh event" shows the block job has been failed:

> event 'block-job' for domain 'OVMF': Block Copy for

> /var/lib/libvirt/images/OVMF.qcow2 failed

> event 'block-job-2' for domain 'OVMF': Block Copy for vda failed

This is again expected, the blockjobs are started by the migration thus

when you cancel the migration we also need to cancel the blockjobs.

> q4: The block job bandwidth of non-shared disk migration cannot be set by

> virDomainBlockJobSetSpeed:

> ➜  ~ virsh blockjob OVMF vda --bandwidth 10

> error: Timed out during operation: cannot acquire state change lock (held

> by monitor=remoteDispatchDomainMigratePerform3Params)

This is okay, but we could take it a sa feature request to allow tuning

of the individual blockjobs.
Assuming that tuning the individual blockjobs is supported, it is hard to tell the bandwidth got from
virDomainMigrateGetMaxSpeed is the speed of  VM migration or the speed of blockjob. 
In contrast to virDomainMigrateSetMaxSpeed, the bandwidth is aimed for both bandwidths.

I am not sure if there is such a user case: the VM migration data is transported via sub-netA while
the block is transported via sub-netB. Then it may require to set different bandwidth for different sub-nets.
If all the data is transported via the same net interface, just  keep it as it is now.

BWT, what is the meaning of  "sa feature"?

> q5: The block job of non-shared disk migration cannot be aborted by

> virDomainBlockJobAbort:

> ➜  ~ virsh blockjob OVMF vda --abort

> error: Timed out during operation: cannot acquire state change lock (held

> by monitor=remoteDispatchDomainMigratePerform3Params)

This is expected. Same as above, we dodn't want to allow users to

control this. In contrast to 'q4' I'd refuse a RFE to allow cancelling

of individual jobs.

> Are the results above expected?

> Here are my personal thoughts:

> For the bandwidth in q1 and q2, they are commented as migration bandwidth(

> https://gitlab.com/libvirt/libvirt/-/blob/master/include/libvirt/libvirt-domain.h#L1165

> ,

> https://gitlab.com/libvirt/libvirt/-/blob/master/src/libvirt-domain.c#L9696

> ), but one works for block jobs while one doesn't. So we should make the

> comment clear whether they are the bandwidth of VM migration or the

> bandwidth of migration with blockjobs. What's more, add a flag to

> virDomainMigrateMaxSpeedFlags to support set bandwidth to the blockjobs in

> migration.

> For q4 and q5, if we will not support to change the block job of non-shared

> disk migration by blockjob APIs, we should note that in the migration doc

> or the block job doc, to present the difference between this type of block

> job and the others.