live migration vs device assignment (was Re: [RFC PATCH V2 00/10] Qemu: Add live migration support for SRIOV NIC)

"Michael S. Tsirkin" <mst@xxxxxxxxxx> · Mon, 7 Dec 2015 18:50:39 +0200

On Tue, Nov 24, 2015 at 09:35:17PM +0800, Lan Tianyu wrote:
> This patchset is to propose a solution of adding live migration
> support for SRIOV NIC.

I thought about what this is doing at the high level, and I do have some
value in what you are trying to do, but I also think we need to clarify
the motivation a bit more.  What you are saying is not really what the
patches are doing.

And with that clearer understanding of the motivation in mind (assuming
it actually captures a real need), I would also like to suggest some
changes.

TLDR:
- split this into 3 unrelated efforts/patchsets
- try implementing this host-side only using VT-d dirty tracking
- if making guest changes, make them in a way that makes many devices benefit
- measure speed before trying to improve it

-------

First, this does not help to actually do migration with an
active assigned device. Guest needs to deactivate the device
before VM is moved around.

What they are actually able to do, instead, is three things.
My suggestion is to split them up, and work on them
separately.  There's really no need to have them all.

I discuss all 3 things below, but if we do need to have some discussion,
please snip and  let's have separate threads for each item please.

1. Starting live migration with device running.
This might help speed up networking during pre-copy where there is a
long warm-up phase.

Note: To complete migration, one also has to do something to stop
the device, but that's a separate item, since existing hot-unplug
request will do that just as well.

Proposed changes of approach:
One option is to write into the dma memory to make it dirty.  Your
patches do this within the driver, but doing this in the generic dma
unmap code seems more elegant as it will help all devices.  An
interesting note: on unplug, driver unmaps all memory for DMA, so this
works out fine.

Some benchmarking will be needed to show the performance overhead.
It is likely non zero, so an interface would be needed
to enable this tracking before starting migration.

According to the VT-d spec, I note that bit 6 in the PTE is the dirty
bit.  Why don't we use this to detect memory changes by the device?
Specifically, periodically scan pages that we have already
sent, test and clear atomically the dirty bit in the PTE of
the IOMMU, and if set, resend the page.
The interface could be simply an ioctl for VFIO giving
it a range of memory, and have VFIO do the scan and set
bits for userspace.

This might be slower than writing into DMA page,
since e.g. PML does not work here.

We could go for a mixed approach, where we negotiate with the
guest: if guest can write into memory on unmap, then
skip the scanning, otherwise do scanning of IOMMU PTEs
as described above.

I would suggest starting with clean IOMMU PTE polling
on host. If you see that there is a performance problem,
optimize later by enabling the updates within guest
if required.

2.  (Presumably) faster device stop.
After the warmup phase, we need to enter the stop and
copy phase. At that point, device needs to be stopped.
One way to do this is to send request to guest while
we continue to track and send memory changes.
I am not sure whether this is what you are doing,
but I'm assuming it is.

I don't know what do you do on the host,
I guesss you could send removal request to guest, and
keep sending page updates meanwhile.
After guest eject/stop acknowledge is received on the host,
you can enter stop and copy.

Your patches seem to stop device with a custom device specific
register, but using more generic interfaces, such as
e.g. device removal, could also work, even if
it's less optimal.

The way you defined the interfaces, they don't
seem device specific at all.
A new PCI capability ID reserved by the PCI SIG
could be one way to add the new interface
if it's needed.

We also need a way to know what does guest support.
With hotplug we know all modern guests support
it, but with any custom code we need negotiation,
and then fall back on either hot unplug
or blocking migration.

Additionally, hot-unplug will unmap all dma
memory so if all dma unmap callbacks do
a write, you get that memory dirtied for free.

At the moment, device removal destroys state such as IP address and arp
cache, but we could have guest move these around
if necessary. Possibly this can be done in userspace with
the guest agent. We could discuss guest kernel or firmware solutions
if we need to address corner cases such as network boot.

You might run into hotplug behaviour such as
a 5 second timeout until device is actually
detected. It always seemed silly to me.
A simple if (!kvm) in that code might be justified.

The fact that guest cooperation is needed
to complete migration is a big problem IMHO.
This practically means you need to give a lot of
CPU to a guest on an overcommitted host
in order to be able to move it out to another host.
Meanwhile, guest can abuse the extra CPU it got.

Can not surprise removal be emulated instead?
Remove device from guest control by unmapping
it from guest PTEs, teach guest not to crash
and not to hang. Ideally reset the device instead.

This sounds like a useful thing to support even
outside virtualization context.

3.  (Presumably) faster device start
Finally, device needs to be started at destination.  Again, hotplug
will also work. Isn't it fast enough? where exactly is
the time spent?

Alternatively, some kind of hot-unplug that makes e.g.
net core save the device state so the following hotplug can restore it
back might work. This is closer to what you are trying to
do, but it is not very robust since device at source
and destination could be slightly different.

A full reset at destination sounds better.

If combining with surprise removal in 2 above, maybe pretend to linux
that there was a fatal error on the device, and have linux re-initialize
it?  To me this sounds better as it will survive across minor device
changes src/dst. Still won't survive if driver happens to change which
isn't something users always have control over.
We could teach linux about a new event that replaces
the device.

Again, might be a useful thing to support even
outside virtualization context.

-- 
MST
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html