Re: race condition? virsh migrate --copy-storage-all

Peter Krempa <pkrempa@xxxxxxxxxx> · Tue, 19 Apr 2022 16:07:29 +0200

On Tue, Apr 19, 2022 at 15:51:32 +0200, Valentijn Sessink wrote:
> Hi Peter,
> 
> Thanks.
> 
> On 19-04-2022 13:22, Peter Krempa wrote:
> > It would be helpful if you provide the VM XML file to see how your disks
> > are configured and the debug log file when the bug reproduces:
> 
> I created a random VM to show the effect. XML file attached.
> 
> > Without that my only hunch would be that you ran out of disk space on
> > the destination which caused the I/O error.
> 
> ... it's an LVM2 volume with exact the same size as the source machine, so
> that would be rather odd ;-)

Oh, you are using raw disks backed by block volumes. That was not
obvious before ;)

> 
> I'm guessing that it's this weird message at the destination machine:
> 
> 2022-04-19 13:31:09.394+0000: 1412559: error : virKeepAliveTimerInternal:137
> : internal error: connection closed due to keepalive timeout

That certainly could be a hint ...

> 
> Source machine says:
> 2022-04-19 13:31:09.432+0000: 2641309: debug :
> qemuMonitorJSONIOProcessLine:220 : Line [{"timestamp": {"seconds":
> 1650375069, "microseconds": 432613}, "event": "BLOCK_JOB_ERROR", "data":
> {"device": "drive-virtio-disk2", "operation": "write", "action": "report"}}]
> 2022-04-19 13:31:09.432+0000: 2641309: debug : virJSONValueFromString:1822 :
> string={"timestamp": {"seconds": 1650375069, "microseconds": 432613},
> "event": "BLOCK_JOB_ERROR", "data": {"device": "drive-virtio-disk2",
> "operation": "write", "action": "report"}}

The migration of non-shared storage works as follows:

1) libvirt sets up everything
2) libvirt asks destination qemu to open an NBD server exporting the
   disk backends
3) source libvirt instructs qemu to copy the disks to the NBD server via
   a block-copy job
4) when the block jobs converge, source qemu is instructed to migrate
   memory
5) when memory migrates, source qemu is killed and destination is
resumed

Now from the keepalive failure on the destiantion it seems that the
network connection at least between the migration controller and the
destination libvirt broke. That might actually cause also the NBD
connection to break and in such case the block job gets an I/O error.

Now the I/O error is actually based on the network connection and not
any storage issue.

So at this point I suspect that something without the network broke and
the migration was aborted in the storage copy phase, but could been in
any other.