> From: Alex Williamson <alex.williamson@xxxxxxxxxx> > Sent: Thursday, July 9, 2020 2:48 AM > > On Wed, 8 Jul 2020 06:31:00 +0000 > "Tian, Kevin" <kevin.tian@xxxxxxxxx> wrote: > > > > From: Alex Williamson <alex.williamson@xxxxxxxxxx> > > > Sent: Wednesday, July 8, 2020 9:07 AM > > > > > > On Tue, 7 Jul 2020 23:28:39 +0000 > > > "Tian, Kevin" <kevin.tian@xxxxxxxxx> wrote: > > > > > > > Hi, Alex, > > > > > > > > Gentle ping... Please let us know whether this version looks good. > > > > > > I figured this is entangled with the versioning scheme. There are > > > unanswered questions about how something that assumes a device of a > > > given type is software compatible to another device of the same type > > > handles aggregation and how the type class would indicate compatibility > > > with an aggregated instance. Thanks, > > > > > > > Yes, this open is an interesting topic. I didn't closely follow the versioning > > scheme discussion. Below is some preliminary thought in my mind: > > > > -- > > First, let's consider migrating an aggregated instance: > > > > A conservative policy is to check whether the compatible type is supported > > on target device and whether available instances under that type can > afford > > the ask of the aggregated instance. Compatibility check in this scheme is > > separated from aggregation check, then no change is required to the > current > > versioning interface. > > How many features, across how many attributes is an administrative tool > supposed to check for compatibility? ie. if we add an 'aggregation' > feature now and 'translucency' feature next year, with new sysfs > attributes and creation options, won't that break this scheme? I'm not > willing to assume aggregation is the sole new feature we will ever add, > therefore we don't get to make it a special case without a plan for how > the next special case will be integrated. Got you. I thought aggregation is special since it is purely about linear resource adjustment w/o changing the feature set of the instance, thus reasonable to get special handling in management stack which needs to understand this attribute anyway. But I agree that it's difficult to predict the future and other special cases... > > We also can't even seem to agree that type is a necessary requirement > for compatibility. Your discussion below of a type-A, which is > equivalent to a type-B w/ aggregation set to some value is an example > of this. We might also have physical devices with extensions to > support migration. These could possibly be compatible with full mdev > devices. We have no idea how an administrative tool would discover > this other than an exhaustive search across every possible target. > That's ugly but feasible when considering a single target host, but > completely untenable when considering a datacenter. If exhaustive search can be done just one-off to build the compatibility database for all assignable devices on each node, then it might be still tenable in datacenter? > > > > Then there comes a case where the target device doesn't handle > aggregation > > but support a different type which however provides compatible > capabilities > > and same resource size as the aggregated instance expects. I guess this is > > one puzzle how to check compatibility between such types. One possible > > extension is to introduce a non_aggregated_list to indicate compatible > > non-aggregated types for each aggregated instance. Then mgmt.. stack > > just loop the compatible list if the conservative policy fails. I didn't think > > carefully about what format is reasonable here. But if we agree that an > > separate interface is required to support such usage, then this may come > > later after the basic migration_version interface is completed. > > ...and then a non_translucency_list and then a non_brilliance_list and > then a non_whatever_list... no. Additionally it's been shown difficult > to predict the future, if a new device is developed to be compatible > with an existing device it would require updates to the existing device > to learn about that compatibility. I suppose a compatibility list like this doesn't require the existing device to update. It should be new device's compatibility to claim compatibility to the types carried in existing list. > > > -- > > > > Another scenario is about migrating a non-aggregated instance to a device > > handling aggregation. Then there is an open whether an aggregated type > > can be used to back the non-aggregated instance in case of no available > > instance under the original type claimed by non-aggregated instance. > > This won't happen in KVMGT, because all vGPU types share the same > > resource pool. Allocating instance under one type also decrement available > > instances under other types. So if we fail to find available instance under > > type-A (with 4x resource of type-B), then we will also fail to create an > > aggregated instance (aggregate=4) under type-B. therefore, we just > > need stick to basic type compatibility check for non-aggregated instance. > > And I feel this assumption can be applied to other devices handling > > aggregation. It doesn't make sense for two types to claim compatibility > > (only with resource size difference) when their resources are allocated > > from different pools (which usually implies different capability or QOS/ > > SLA difference). With this assumption, we don't need provide another > > interface to indicate compatible aggregated types for non-aggregated > > interface. > > -- > > > > I may definitely overlook something here, but if above analysis sounds > > reasonable, then this series could be decoupled from the versioning > > scheme discussion based on conservative policy for now. :) > > The only potential I see for decoupling the discussions would be to do > aggregation via a vendor attribute. Those already provide a mechanism > to manipulate a device after creation and something that we'll already > need to solve in determining migration compatibility. So in that > sense, it seems like it at least doesn't make the problem worse. > Thanks, > This makes some sense, since anyway 'aggregation' still changes how the instance looks like. But let me understand clearly. Are you proposing actually moving 'aggregation' to be a vendor attribute (i.e. removing the 'mdev' sub-directy in this patch), or more about a policy of treating it as a vendor attribute? If the former, is there any problem of having Libvirt manage this attribute given that it becomes vendor specific now? Thanks Kevin