Re: [PATCH] Migration: Preserve the failed job in case migration job is terminated beyond the perform phase.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Jirka,

On Thu, Jan 25, 2018 at 8:43 PM, Jiri Denemark <jdenemar@xxxxxxxxxx> wrote:
On Thu, Jan 25, 2018 at 19:51:23 +0530, Prerna Saxena wrote:
> In case of non-p2p migration, in case libvirt client gets disconnected from source libvirt
> after PERFORM phase is over, the daemon just resets the current migration job.
> However, the VM could be left paused on both source and destination in such case. In case
> the client reconnects and queries migration status, the job has been blanked out from source libvirt,
> and this reconnected client has no clear way of figuring out if an unclean migration had previously
> been aborted.

The virDomainGetState API should return VIR_DOMAIN_PAUSED with
VIR_DOMAIN_PAUSED_MIGRATION reason. Is this not enough?


I understand that a client application should poll source libvirtd for status of migration job completion using virDomainGetJobStats().
However, as you explained above, cleanup callbacks clear the job info so a client should additionally be polling for virDomainGetState() too.
Would it not be cleaner to have a single API reflect the source of truth?
 
> This patch calls out a "potentially" incomplete migration as a failed
> job, so that a client may

As you say it's potentially incomplete, so marking it as failed is not
completely correct. It's a split brain when the source cannot
distinguish whether the migration was successful or not.

Agree, it might have run to completion too, as we observed in some cases. Do you think marking the job status as "UNKNOWN" is better articulation of the current state?
 

> diff --git a/src/qemu/qemu_domain.c b/src/qemu/qemu_domain.c
> index e8e0313..7c60d17 100644
> --- a/src/qemu/qemu_domain.c
> +++ b/src/qemu/qemu_domain.c
> @@ -4564,6 +4564,22 @@ qemuDomainObjDiscardAsyncJob(virQEMUDriverPtr driver, virDomainObjPtr obj)
>      qemuDomainObjSaveJob(driver, obj);
>  }
>
> +
> +void
> +qemuDomainObjFailAsyncJob(virQEMUDriverPtr driver, virDomainObjPtr obj)
> +{
> +    qemuDomainObjPrivatePtr priv = obj->privateData;
> +    VIR_FREE(priv->job.completed);
> +    if (VIR_ALLOC(priv->job.completed) == 0) {
> +        priv->job.current->type = VIR_DOMAIN_JOB_FAILED;
> +        priv->job.completed = priv->job.current;

This will just leak the memory allocated for priv->job.completed by
overwriting the pointer to the one from priv->job.current, ...

> +    } else {
> +        VIR_WARN("Unable to allocate job.completed for VM %s", obj->def->name);
> +    }
> +    qemuDomainObjResetAsyncJob(priv);

which will point to a freed memory after this call.

Agree, I will fix this.
 

> +    qemuDomainObjEndJob(driver, obj);

And while this is almost certainly (I didn't really check though) not
something you should call from a close callback, you don't save the
changes to the status XML so once libvirtd restarts, it will think the
domain is still being migrated.

I will add the same to status XML. 
I am suggesting that strengthening the job data would be additionally useful. If the daemon has not restarted, job information can still get us fairly accurate status of migration. Pls let me know if you think this is not useful, I will be happy to learn the rationale.

Regards,
Prerna
--
libvir-list mailing list
libvir-list@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/libvir-list

[Index of Archives]     [Virt Tools]     [Libvirt Users]     [Lib OS Info]     [Fedora Users]     [Fedora Desktop]     [Fedora SELinux]     [Big List of Linux Books]     [Yosemite News]     [KDE Users]     [Fedora Tools]

  Powered by Linux