Re: [PATCH 0/5] QEMU VFIO live migration

Zhao Yan <yan.y.zhao@xxxxxxxxx> · Thu, 28 Mar 2019 04:36:03 -0400

hi Alex and Dave,
Thanks for your replies.
Please see my comments inline.

On Thu, Mar 28, 2019 at 06:10:20AM +0800, Alex Williamson wrote:
> On Wed, 27 Mar 2019 20:18:54 +0000
> "Dr. David Alan Gilbert" <dgilbert@xxxxxxxxxx> wrote:
> 
> > * Zhao Yan (yan.y.zhao@xxxxxxxxx) wrote:
> > > On Wed, Feb 20, 2019 at 07:42:42PM +0800, Cornelia Huck wrote:  
> > > > > > > >   b) How do we detect if we're migrating from/to the wrong device or
> > > > > > > > version of device?  Or say to a device with older firmware or perhaps
> > > > > > > > a device that has less device memory ?    
> > > > > > > Actually it's still an open for VFIO migration. Need to think about
> > > > > > > whether it's better to check that in libvirt or qemu (like a device magic
> > > > > > > along with verion ?).    
> > > > > 
> > > > > We must keep the hardware generation is the same with one POD of public cloud
> > > > > providers. But we still think about the live migration between from the the lower
> > > > > generation of hardware migrated to the higher generation.  
> > > > 
> > > > Agreed, lower->higher is the one direction that might make sense to
> > > > support.
> > > > 
> > > > But regardless of that, I think we need to make sure that incompatible
> > > > devices/versions fail directly instead of failing in a subtle, hard to
> > > > debug way. Might be useful to do some initial sanity checks in libvirt
> > > > as well.
> > > > 
> > > > How easy is it to obtain that information in a form that can be
> > > > consumed by higher layers? Can we find out the device type at least?
> > > > What about some kind of revision?  
> > > hi Alex and Cornelia
> > > for device compatibility, do you think it's a good idea to use "version"
> > > and "device version" fields?
> > > 
> > > version field: identify live migration interface's version. it can have a
> > > sort of backward compatibility, like target machine's version >= source
> > > machine's version. something like that.
> 
> Don't we essentially already have this via the device specific region?
> The struct vfio_info_cap_header includes id and version fields, so we
> can declare a migration id and increment the version for any
> incompatible changes to the protocol.
yes, good idea!
so, what about declaring below new cap? 
    #define VFIO_REGION_INFO_CAP_MIGRATION 4
    struct vfio_region_info_cap_migration {
        struct vfio_info_cap_header header;
        __u32 device_version_len;
        __u8  device_version[];
    };

> > > 
> > > device_version field consists two parts:
> > > 1. vendor id : it takes 32 bits. e.g. 0x8086.
> 
> Who allocates IDs?  If we're going to use PCI vendor IDs, then I'd
> suggest we use a bit to flag it as such so we can reserve that portion
> of the 32bit address space.  See for example:
> 
> #define VFIO_REGION_TYPE_PCI_VENDOR_TYPE        (1 << 31)
> #define VFIO_REGION_TYPE_PCI_VENDOR_MASK        (0xffff)
> 
> For vendor specific regions.
Yes, use PCI vendor ID.
you are right, we need to use highest bit (VFIO_REGION_TYPE_PCI_VENDOR_TYPE)
to identify it's a PCI ID.
Thanks for pointing it out. 
But, I have a question. what is VFIO_REGION_TYPE_PCI_VENDOR_MASK used for?
why it's 0xffff? I searched QEMU and kernel code and did not find anywhere
uses it.

> > > 2. vendor proprietary string: it can be any string that a vendor driver
> > > thinks can identify a source device. e.g. pciid + mdev type.
> > > "vendor id" is to avoid overlap of "vendor proprietary string".
> > > 
> > > 
> > > struct vfio_device_state_ctl {
> > >      __u32 version;            /* ro */
> > >      __u8 device_version[MAX_DEVICE_VERSION_LEN];            /* ro */
> > >      struct {
> > >      	__u32 action; /* GET_BUFFER, SET_BUFFER, IS_COMPATIBLE*/
> > > 	...
> > >      }data;
> > >      ...
> > >  };
> 
> We have a buffer area where we can read and write data from the vendor
> driver, why would we have this IS_COMPATIBLE protocol to write the
> device version string but use a static fixed length version string in
> the control header to read it?  IOW, let's use GET_VERSION,
> CHECK_VERSION actions that make use of the buffer area and allow vendor
> specific version information length.
you are right, such static fixed length version string is bad :)
To get device version, do you think which approach below is better?
1. use GET_VERSION action, and read from region buffer
2. get it when querying cap VFIO_REGION_INFO_CAP_MIGRATION

seems approach 1 is better, and cap VFIO_REGION_INFO_CAP_MIGRATION is only
for checking migration interface's version?

> > > 
> > > Then, an action IS_COMPATIBLE is added to check device compatibility.
> > > 
> > > The flow to figure out whether a source device is migratable to target device
> > > is like that:
> > > 1. in source side's .save_setup, save source device's device_version string
> > > 2. in target side's .load_state, load source device's device version string
> > > and write it to data region, and call IS_COMPATIBLE action to ask vendor driver
> > > to check whether the source device is compatible to it.
> > > 
> > > The advantage of adding an IS_COMPATIBLE action is that, vendor driver can
> > > maintain a compatibility table and decide whether source device is compatible
> > > to target device according to its proprietary table.
> > > In device_version string, vendor driver only has to describe the source
> > > device as elaborately as possible and resorts to vendor driver in target side
> > > to figure out whether they are compatible.  
> 
> I agree, it's too complicated and restrictive to try to create an
> interface for the user to determine compatibility, let the driver
> declare it compatible or not.
:)

> > It would also be good if the 'IS_COMPATIBLE' was somehow callable
> > externally - so we could be able to answer a question like 'can we
> > migrate this VM to this host' - from the management layer before it
> > actually starts the migration.

so qemu needs to expose two qmp/sysfs interfaces: GET_VERSION and CHECK_VERSION.
GET_VERSION returns a vm's device's version string.
CHECK_VERSION's input is device version string and return
compatible/non-compatible.
Do you think it's good?

> I think we'd need to mirror this capability in sysfs to support that,
> or create a qmp interface through QEMU that the device owner could make
> the request on behalf of the management layer.  Getting access to the
> vfio device requires an iommu context that's already in use by the
> device owner, we have no intention of supporting a model that allows
> independent tasks access to a device.  Thanks,
> Alex
>
do you think two sysfs nodes under a device node is ok?
e.g.
/sys/devices/pci0000\:00/0000\:00\:02.0/882cc4da-dede-11e7-9180-078a62063ab1/get_version
/sys/devices/pci0000\:00/0000\:00\:02.0/882cc4da-dede-11e7-9180-078a62063ab1/check_version

Thanks
Yan