Re: I/O errors after migration - why?

Takeshi Sone <ts1@xxxxxxxxxx> · Mon, 30 Mar 2009 14:38:07 +0900

Hello,

I had similar problem regarding block I/O and migration.
And it is worked around by qemu "stop" command and waiting 1 second
before starting migration (and cont after migration).
See the Ubuntu bug report I posted.
https://bugs.launchpad.net/ubuntu/+source/kvm/+bug/341682

I think Nolan description here explains why stop and wait works.

Nolan wrote:
> On Sat, 2009-03-28 at 11:21 +0100, Tomasz Chmielewski wrote:
>> Nolan schrieb:
>>> Tomasz Chmielewski <mangoo <at> wpkg.org> writes:
>>>> I'm trying to perform live migration by following the instructions on 
>>>> http://www.linux-kvm.org/page/Migration.
>>>> Unfortunately, it doesn't work very well - guest is migrated, but looses 
>>>> access to its disk.
>>> The LSI logic scsi device model doesn't implement device state save/restore. 
>>> Any suspend/resume, snapshot or migration will fail.
>> Oh, that sucks - as not everything supports virtio (which doesn't work 
>> for me as well for some reason) - like Windows (which should be 
>> addressed soon with block virtio drivers), but also older installations, 
>> running older kernels.
> 
> It is indeed a shame.  I wish I had the time to investigate and resolve
> the problems with my patch that I linked to previously.
> 
> LSI in particular is important for interoperability, as that is what
> VMware uses.
> 
>> Does IDE support migration?
> 
> It appears to, but I am not 100% sure that it will always survive
> migration under heavy IO load.  I've gotten mixed messages on whether or
> not the qemu core waits for all in flight IOs to complete or if the
> device models need to checkpoint pending IOs themselves.  Experimental
> evidence suggests that it does not.  Also, from ide.c's checkpoint save
> code:
>     /* XXX: if a transfer is pending, we do not save it yet */
> 
> I think the ideal here would be to stop the CPUs, but let the device
> models continue to run.  Once all pending IOs have completed (and DMAed
> data and/or descriptors into guest memory, or raised interrupts, or
> whatever) then checkpoint all device state.  When the guest resumes, it
> will see an unusual flurry of IO completions and/or interrupts, but it
> should be able to handle that OK.  Shouldn't look much different from
> SMM taking over for a while during high IO load.
> 
> This would save a lot of (unwritten, complex, hard to test)
> checkpointing code in the device models.  Might cause a missed timer
> interrupt or two if there is a lot of slow IO, but that can be
> compensated for if needed.
> 
>>> I sent a patch that partially addresses this (but is buggy in the presence of
>>> in-flight IO):
>>> http://lists.gnu.org/archive/html/qemu-devel/2009-01/msg00744.html
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html