Re: device compatibility interface for live migration with assigned devices

Yan Zhao <yan.y.zhao@xxxxxxxxx> · Wed, 26 Aug 2020 16:54:11 +0800

On Thu, Aug 20, 2020 at 02:24:26PM +0100, Sean Mooney wrote:
> On Thu, 2020-08-20 at 14:27 +0800, Yan Zhao wrote:
> > On Thu, Aug 20, 2020 at 06:16:28AM +0100, Sean Mooney wrote:
> > > On Thu, 2020-08-20 at 12:01 +0800, Yan Zhao wrote:
> > > > On Thu, Aug 20, 2020 at 02:29:07AM +0100, Sean Mooney wrote:
> > > > > On Thu, 2020-08-20 at 08:39 +0800, Yan Zhao wrote:
> > > > > > On Tue, Aug 18, 2020 at 11:36:52AM +0200, Cornelia Huck wrote:
> > > > > > > On Tue, 18 Aug 2020 10:16:28 +0100
> > > > > > > Daniel P. Berrangé <berrange@xxxxxxxxxx> wrote:
> > > > > > > 
> > > > > > > > On Tue, Aug 18, 2020 at 05:01:51PM +0800, Jason Wang wrote:
> > > > > > > > >    On 2020/8/18 下午4:55, Daniel P. Berrangé wrote:
> > > > > > > > > 
> > > > > > > > >  On Tue, Aug 18, 2020 at 11:24:30AM +0800, Jason Wang wrote:
> > > > > > > > > 
> > > > > > > > >  On 2020/8/14 下午1:16, Yan Zhao wrote:
> > > > > > > > > 
> > > > > > > > >  On Thu, Aug 13, 2020 at 12:24:50PM +0800, Jason Wang wrote:
> > > > > > > > > 
> > > > > > > > >  On 2020/8/10 下午3:46, Yan Zhao wrote:  
> > > > > > > > >  we actually can also retrieve the same information through sysfs, .e.g
> > > > > > > > > 
> > > > > > > > >  |- [path to device]
> > > > > > > > >     |--- migration
> > > > > > > > >     |     |--- self
> > > > > > > > >     |     |   |---device_api
> > > > > > > > >     |    |   |---mdev_type
> > > > > > > > >     |    |   |---software_version
> > > > > > > > >     |    |   |---device_id
> > > > > > > > >     |    |   |---aggregator
> > > > > > > > >     |     |--- compatible
> > > > > > > > >     |     |   |---device_api
> > > > > > > > >     |    |   |---mdev_type
> > > > > > > > >     |    |   |---software_version
> > > > > > > > >     |    |   |---device_id
> > > > > > > > >     |    |   |---aggregator
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > >  Yes but:
> > > > > > > > > 
> > > > > > > > >  - You need one file per attribute (one syscall for one attribute)
> > > > > > > > >  - Attribute is coupled with kobject
> > > > > > > 
> > > > > > > Is that really that bad? You have the device with an embedded kobject
> > > > > > > anyway, and you can just put things into an attribute group?
> > > > > > > 
> > > > > > > [Also, I think that self/compatible split in the example makes things
> > > > > > > needlessly complex. Shouldn't semantic versioning and matching already
> > > > > > > cover nearly everything? I would expect very few cases that are more
> > > > > > > complex than that. Maybe the aggregation stuff, but I don't think we
> > > > > > > need that self/compatible split for that, either.]
> > > > > > 
> > > > > > Hi Cornelia,
> > > > > > 
> > > > > > The reason I want to declare compatible list of attributes is that
> > > > > > sometimes it's not a simple 1:1 matching of source attributes and target attributes
> > > > > > as I demonstrated below,
> > > > > > source mdev of (mdev_type i915-GVTg_V5_2 + aggregator 1) is compatible to
> > > > > > target mdev of (mdev_type i915-GVTg_V5_4 + aggregator 2),
> > > > > >                (mdev_type i915-GVTg_V5_8 + aggregator 4)
> > > > > 
> > > > > the way you are doing the nameing is till really confusing by the way
> > > > > if this has not already been merged in the kernel can you chagne the mdev
> > > > > so that mdev_type i915-GVTg_V5_2 is 2 of mdev_type i915-GVTg_V5_1 instead of half the device
> > > > > 
> > > > > currently you need to deived the aggratod by the number at the end of the mdev type to figure out
> > > > > how much of the phsicial device is being used with is a very unfridly api convention
> > > > > 
> > > > > the way aggrator are being proposed in general is not really someting i like but i thin this at least
> > > > > is something that should be able to correct.
> > > > > 
> > > > > with the complexity in the mdev type name + aggrator i suspect that this will never be support
> > > > > in openstack nova directly requireing integration via cyborg unless we can pre partion the
> > > > > device in to mdevs staicaly and just ignore this.
> > > > > 
> > > > > this is way to vendor sepecif to integrate into something like openstack in nova unless we can guarentee
> > > > > taht how aggreator work will be portable across vendors genericly.
> > > > > 
> > > > > > 
> > > > > > and aggragator may be just one of such examples that 1:1 matching does not
> > > > > > fit.
> > > > > 
> > > > > for openstack nova i dont see us support anything beyond the 1:1 case where the mdev type does not change.
> > > > > 
> > > > 
> > > > hi Sean,
> > > > I understand it's hard for openstack. but 1:N is always meaningful.
> > > > e.g.
> > > > if source device 1 has cap A, it is compatible to
> > > > device 2: cap A,
> > > > device 3: cap A+B,
> > > > device 4: cap A+B+C
> > > > ....
> > > > to allow openstack to detect it correctly, in compatible list of
> > > > device 2, we would say compatible cap is A;
> > > > device 3, compatible cap is A or A+B;
> > > > device 4, compatible cap is A or A+B, or A+B+C;
> > > > 
> > > > then if openstack finds device A's self cap A is contained in compatible
> > > > cap of device 2/3/4, it can migrate device 1 to device 2,3,4.
> > > > 
> > > > conversely,  device 1's compatible cap is only A,
> > > > so it is able to migrate device 2 to device 1, and it is not able to
> > > > migrate device 3/4 to device 1.
> > > 
> > > yes we build the palcement servce aroudn the idea of capablites as traits on resocue providres.
> > > which is why i originally asked if we coudl model compatibality with feature flags
> > > 
> > > we can seaislyt model deivce as aupport A, A+B or  A+B+C
> > > and then select hosts and evice based on that but
> > > 
> > > the list of compatable deivce you are propsoeing hide this feature infomation which whould be what we are matching
> > > on.
> > > 
> > > give me a lset of feature you want and list ting the feature avaiable on each device allow highre level ocestation
> > > to
> > > easily match the request to a host that can fulllfile it btu thave a set of other compatihble device does not help
> > > with
> > > that
> > > 
> > > so if a simple list a capabliteis can be advertiese d and if we know tha two dievce with the same capablity are
> > > intercahangebale that is workabout i suspect that will not be the case however and it would onely work within a
> > > familay
> > > of mdevs that are closely related.  which i think agian is an argument for not changeing the mdev type and at least
> > > intially only look at migatreion where the mdev type doee not change initally. 
> > > 
> > 
> > sorry Sean, I don't understand your words completely.
> > Please allow me to write it down in my words, and please confirm if my
> > understanding is right.
> > 1. you mean you agree on that each field is regarded as a trait, and
> > openstack can compare by itself if source trait is a subset of target trait, right?
> > e.g.
> > source device
> > field1=A1
> > field2=A2+B2
> > field3=A3
> > 
> > target device
> > field1=A1+B1
> > field2=A2+B2
> > filed3=A3
> > 
> > then openstack sees that field1/2/3 in source is a subset of field1/2/3 in
> > target, so it's migratable to target?
> 
> yes this is basically how cpu feature work.
> if we see the host cpu on the dest is a supperset of the cpu feature used
> by the vm we know its safe to migrate.

got it. glad to know it :)
> > 
> > 2. mdev_type + aggregator make it hard to achieve the above elegant
> > solution, so it's best to avoid the combined comparing of mdev_type + aggregator.
> > do I understand it correctly?
> yes and no. one of the challange that mdevs pose right now is that sometiem mdev model
> independent resouces and sometimes multipe mdev types consume the same underlying resouces
> there is know way for openstack to know if i915-GVTg_V5_2 and i915-GVTg_V5_4 consume the same resouces
> or not. as such we cant do the accounting properly so i would much prefer to have just 1 mdev type
> i915-GVTg and which models the minimal allocatable unit and then say i want 4 of them comsed into 1 device
> then have a second mdev type that does that since
> 
> what that means in pratice is we cannot trust the available_instances for a given mdev type
> as consuming a different mdev type might change it. aggrators makes that problem worse.
> which is why i siad i would prefer if instead of aggreator as prposed each consumable
> resouce was reported indepenedly as different mdev types and then we composed those
> like we would when bond ports creating an attachment or other logical aggration that refers
> to instance of mdevs of differing type which we expose as a singel mdev that is exposed to the guest.
> in a concreate example we might say create a aggreator of 64 cuda cores and 32 tensor cores and "bond them"
> or aggrate them as a single attachme mdev and provide that to a ml workload guest. a differnt guest could request
> 1 instace of the nvenc video encoder and one instance of the nvenc video decoder but no cuda or tensor for a video
> transcoding workload.
> 
The "bond" you described is a little different from the intension of the
aggregator we introduced for scalable IOV. (as explained in another mail
to Cornelia https://lists.gnu.org/archive/html/qemu-devel/2020-08/msg06523.html).

But any way, we agree that mdevs are not compatible if mdev_types are not compatible.  

> if each of those componets are indepent mdev types and can be composed with that granularity then i think that approch
> is better then the current aggreator with vendor sepcific fileds.
> we can model the phsical device as being multipel nested resouces with different traits for each type of resouce and
> different capsities for the same. we can even model how many of the attachments/compositions can be done indepently
> if there is a limit on that.
> 
> |- [parent physical device]
> |--- Vendor-specific-attributes [optional]
> |--- [mdev_supported_types]
> |     |--- [<type-id>]
> |     |   |--- create
> |     |   |--- name
> |     |   |--- available_instances
> |     |   |--- device_api
> |     |   |--- description
> |     |   |--- [devices]
> |     |--- [<type-id>]
> |     |   |--- create
> |     |   |--- name
> |     |   |--- available_instances
> |     |   |--- device_api
> |     |   |--- description
> |     |   |--- [devices]
> |     |--- [<type-id>]
> |          |--- create
> |          |--- name
> |          |--- available_instances
> |          |--- device_api
> |          |--- description
> |          |--- [devices]
> 
> a benifit of this appoch is we would be the mdev types would not change on migration 
> and we could jsut compuare a a simeple version stirgh and feature flag list to determin comaptiablity
> in a vendor neutral way. i dont nessisarly need to know what the vendeor flags mean just that the dest is a subset of
> the source and that the semaitic version numbers say the mdevs are compatible.
> > 
as aggregator and some other attributes are only meaningful after
devices are created, and vendors' naming of mdev types are not unified,
do you think below way is good?

|- [parent physical device]
|--- [mdev_supported_types]
|     |--- [<type-id>]
|     |   |--- create
|     |   |--- name
|     |   |--- available_instances
|     |   |--- compatible_type [must]
|     |   |--- Vendor-specific-compatible-type-attributes [optional]
|     |   |--- device_api [must]
|     |   |--- software_version [must]
|     |   |--- description
|     |   |--- [devices]
|     |   |--------[<uuid>]
|     |   |            |--- vendor-specific-compatible-device-attriutes [optional]

all vendor specific compatible attributes begin with compatible in name.

in GVT's current case,
|- 0000\:00\:02.0
|--- mdev_supported_types
|     |--- i915-GVTg_V5_8
|     |   |--- create
|     |   |--- name
|     |   |--- available_instances
|     |   |--- compatible_type : i915-GVTg_V5_8, i915-GVTg_V4_8
|     |   |--- device_api : vfio-pci
|     |   |--- software_version : 1.0.0
|     |   |--- compatible_pci_ids : 5931, 591b
|     |   |--- description
|     |   |--- devices
|     |   |       |- 882cc4da-dede-11e7-9180-078a62063ab1
|     |   |       |     | --- aggregator : 1
|     |   |       |     | --- compatible_aggregator : 1

suppose 882cc4da-dede-11e7-9180-078a62063ab1 is a src mdev.
the sequence for openstack to find a compatible mdev in my mind is that
1. make src mdev type and compatible_type as traits.

2. look for a mdev type that is either i915-GVTg_V4_8 or i915-GVTg_V5_8
as that in compatible_type.
(this is just an example, currently we only support migration between
mdevs whose attributes are all matching, from mdev type to aggregator,
to pci_ids)

3. if 2 fails, try to find a mdev type whose compatible_type is a
superset of src compatible_type. if found one, go to step 4; otherwise,
quit.

4. check if device_api, software_version under the type are compatible.

5. check if other vendor specific type attributes under the type are compatible.
- check if src compatible_pci_ids is a subset of target compatible_pci_ids.

6. check if device is created and not occupied, if not, create one.

7. check if vendor specific attributes under the device are compatible.
- check if src compatible_aggregator is a subset of target compatible_aggregator.
  if fails, try to find counterpart attribute of vendor specific device attribute
  and set target value according to compatible_xxx in source side.
  (for compatible_aggregator, its counterpart is aggregator.)
  if attribute aggregator exists, step 7 succeeds when setting of its value succeeds.
  if attribute aggregator does not exist, step 7 fails.

8. a compatible target is found.

not sure if the above steps look good to you.

some changes are required for compatibility check for physical device when mdev_type is absent.
but let's first arrive at consensus for mdevs first :)

> > 3. you don't like self list and compatible list, because it is hard for
> > openstack to compare different traits?
> > e.g. if we have self list and compatible list, then as below, openstack needs
> > to compare if self field1/2/3 is a subset of compatible field 1/2/3.
> currnetly we only use mdevs for vGPUs and in our documentaiton we tell customer
> to model the mdev_type as a trait and request it as a reuiqred trait.
> so for customer that are doing that today changing mdev types is not really an option.
> we would prefer that they request the feature they need instead of a spefic mdev type
> so we can select any that meets there needs
> for example we have a bunch of traits for cuda support
> https://github.com/openstack/os-traits/blob/master/os_traits/hw/gpu/cuda.py
> or driectx/vulkan/opengl https://github.com/openstack/os-traits/blob/master/os_traits/hw/gpu/api.py
> these are closely analogous to cpu feature flag lix avx or sse
> https://github.com/openstack/os-traits/blob/master/os_traits/hw/cpu/x86/__init__.py#L16
> 
> so when it comes to compatiablities it would be ideal if you could express capablities as something like
> a cpu feature flag then we can eaisly model those as traits. 
> > 
> > source device:
> > self field1=A1
> > self field2=A2+B2
> > self field3=A3
> > 
> > compatible field1=A1
> > compatible field2=A2;B2;A2+B2;
> > compatible field3=A3
> > 
> > 
> > target device:
> > self field1=A1+B1
> > self field2=A2+B2
> > self field3=A3
> > 
> > compatible field1=A1;B1;A1+B1;
> > compatible field2=A2;B2;A2+B2;
> > compatible field3=A3
> > 
> > 
> > Thanks
> > Yan
> > 
> > 
> > > > 
> > > > 
> > > > > i woudl really prefer if there was just one mdev type that repsented the minimal allcatable unit and the
> > > > > aggragaotr where used to create compostions of that. i.e instad of i915-GVTg_V5_2 beign half the device,
> > > > > have 1 mdev type i915-GVTg and if the device support 8 of them then we can aggrate 4 of i915-GVTg
> > > > > 
> > > > > if you want to have muplie mdev type to model the different amoutn of the resouce e.g. i915-GVTg_small i915-
> > > > > GVTg_large
> > > > > that is totlaly fine too or even i915-GVTg_4 indcating it sis 4 of i915-GVTg
> > > > > 
> > > > > failing that i would just expose an mdev type per composable resouce and allow us to compose them a the user
> > > > > level
> > > > > with
> > > > > some other construct mudeling a attament to the device. e.g. create composed mdev or somethig that is an
> > > > > aggreateion
> > > > > of
> > > > > multiple sub resouces each of which is an mdev. so kind of like how bond port work. we would create an mdev for
> > > > > each
> > > > > of
> > > > > the sub resouces and then create a bond or aggrated mdev by reference the other mdevs by uuid then attach only
> > > > > the
> > > > > aggreated mdev to the instance.
> > > > > 
> > > > > the current aggrator syntax and sematic however make me rather uncofrotable when i think about orchestating vms
> > > > > on
> > > > > top
> > > > > of it even to boot them let alone migrate them.
> > > > > > 
> > > > > > So, we explicitly list out self/compatible attributes, and management
> > > > > > tools only need to check if self attributes is contained compatible
> > > > > > attributes.
> > > > > > 
> > > > > > or do you mean only compatible list is enough, and the management tools
> > > > > > need to find out self list by themselves?
> > > > > > But I think provide a self list is easier for management tools.
> > > > > > 
> > > > > > Thanks
> > > > > > Yan
> > > > > > 
> > > > 
> > > > 
> > 
> > 
>