On Wed, 1 Aug 2018 13:12:05 +0200 Halil Pasic <pasic@xxxxxxxxxxxxx> wrote: > On 07/27/2018 07:44 PM, Alex Williamson wrote: > > On Thu, 26 Jul 2018 21:54:29 +0200 > > Christian Borntraeger <borntraeger@xxxxxxxxxx> wrote: > > ... > >> +The process for reserving an AP queue for use by a KVM guest is: > >> + > >> +* The vfio-ap driver during its initialization will perform the following: > >> + * Create the 'vfio_ap' root device - /sys/devices/virtual/misc/vfio_ap > >> + * Create the 'matrix' device in the 'vfio_ap' root > >> + * Register the matrix device with the device core > >> +* Register with the ap_bus for AP queue devices of type 10 devices (CEX4 and > >> + newer) and to provide the vfio_ap driver's probe and remove callback > >> + interfaces. The reason why older devices are not supported is because there > >> + are no systems available on which to test. > >> +* The admin needs to reserve the AP Queue for vfio_ap driver. This can be > >> + done by writing the AP adapter mask to /sys/bus/ap/apmask and the AP domain > >> + mask to /sys/bus/ap/aqmask. > >> + > >> + For example to reserve all the AP Queues on the system for vfio_ap driver: > >> + > >> + echo 0x0000000000000000000000000000000000000000000000000000000000000000 > /sys/bus/ap/apmask > >> + echo 0x0000000000000000000000000000000000000000000000000000000000000000 > /sys/bus/ap/aqmask > > > > And this is a reasonable user syntax? ;) > > > > We intend to provide tooling that makes this more human friendly. BTW > echo 0 > sys/bus/ap/apmask would also work -- and looks less scary. If tooling is required, perhaps the interface is wrong by design. Maybe the interface should accept a list of bits, for example: echo 255,0 > /sys/bus/ap/aqmask vs echo 0x8000000000000000000000000000000000000000000000000000000000000001 > /sys/bus/ap/aqmask > > ... > >> + * mdev_attr_groups > >> + This attribute group identifies the user-defined sysfs attributes of the > >> + mediated device. When a device is registered with the VFIO mediated device > >> + framework, the sysfs attributes files identified in the 'mdev_attr_groups' > >> + structure will be created in the mediated matrix device's directory. The > >> + sysfs attributes for a mediated matrix device are: > >> + * assign_adapter: > >> + A write-only file for assigning an AP adapter to the mediated matrix > >> + device. To assign an adapter, the APID of the adapter is written to the > >> + file. > >> + * assign_domain: > >> + A write-only file for assigning an AP usage domain to the mediated matrix > >> + device. To assign a domain, the APQI of the AP queue corresponding to a > >> + usage domain is written to the file. > >> + * matrix: > >> + A read-only file for displaying the APQNs derived from the adapters and > >> + domains assigned to the mediated matrix device. > >> + * assign_control_domain: > >> + A write-only file for assigning an AP control domain to the mediated > >> + matrix device. To assign a control domain, the ID of a domain to be > >> + controlled is written to the file. For the initial implementation, the set > >> + of control domains will always include the set of usage domains, so it is > >> + only necessary to assign control domains that are not also assigned as > >> + usage domains. > > > > > > How will the user know when this changes? What's the transition plan > > to maintain compatibility when it does change? > > > > I don't think this is supposed to change under the users feet. This document > probably started out as a design document. While I'm personally in favor of > managing the two separate, I think we need to commit to either-or and > get rid of 'For the initial implementation,'. +1 > > ... > >> +4. The administrator now needs to configure the matrixes for mediated > >> + devices $uuid1 (for Guest1) and $uuid2 (for Guest2). > >> + > >> + This is how the matrix is configured for Guest1: > >> + > >> + echo 5 > assign_adapter > >> + echo 6 > assign_adapter > >> + echo 4 > assign_domain > >> + echo 0xab > assign_domain > >> + > >> + For this implementation, all usage domains - i.e., domains assigned > >> + via the assign_domain attribute file - will also be configured in the ADM > >> + field of the KVM guest's CRYCB, so there is no need to assign control > >> + domains here unless you want to assign control domains that are not > >> + assigned as usage domains. > >> + > >> + If a mistake is made configuring an adapter, domain or control domain, > >> + you can use the unassign_xxx files to unassign the adapter, domain or > >> + control domain. > > > > Would it be more consistent with other sysfs entries to call these > > "bind" and "unbind" rather than "assign" and "unassign"? > > > > I'm not sure. Let me recap what is this about. First we partition our AP > resources (queues) into 'to be used by the host' and 'to be used by the > guests' using the /sys/bus/ap/a[pq]mask. Then we sub-partition 'to be > used by the guests' among the guests. Each vfio ap_matrix represents a > partition that corresponds to a single VM (guest). Such a partition can > be seen as a device that grants access (usage or administration) to a set > of queues. We can grant access to queues even if these are not currently > available in the system. E.g. we can partition an AP card that is > currently not online or even plugged. When The resources become available > (e.g. new card plugged, or just 'configured'), the resources (should) > become available at all levels if the authorization is right. I know > 'bind' binding a device (that currently exists within the system) to a > driver. The vfio_ap stuff seems different enough to seek a different name > for it. As discussed elsewhere, I'm not in favor of allowing mdev devices to exist which rely on resources that are not present or in-use elsewhere. If we remove that concept, then we are actually binding and unbinding resources to the mdev device. > >> + > >> + To display the matrix configuration for Guest1: > >> + > >> + cat matrix > >> + > >> + This is how the matrix is configured for Guest2: > >> + > >> + echo 5 > assign_adapter > >> + echo 0x47 > assign_domain > >> + echo 0xff > assign_domain > >> + > >> +5. The adminstrator now needs to activate the mediated devices. > >> + echo 1 > activate > > > > Or optionally not. As in reply to cover letter, I don't think this > > interface is sufficiently justified, or necessarily desirable. > > > > You are right. Activation is not necessary because the open will try to > auto-activate it. But honestly, I would recommend activating it in > advance, so the admin can be sure the resources are claimed. This can also be solved by claiming the resource at the point at which it is bound/assigned/attached to the mdev device (ie. at any point where we introduce new intersections in the matrix). > In our current design we choose to give the administrator the freedom, to > decide on his policy how does he want to manage the vfio_ap mdevs > lifecycle. And to decide what guarantees does he want to have at a given > point in time. With 'activate' both 'dynamically and on-demand of > starting a VM' and 'create known vfio_ap mdevs at system bring up' are > viable. If there were a consensus on how the life-cycle of the mdev > devices is to be managed, it would be easier to not give this freedom to > the admin. Please, this is not a freedom of the admin issue, this simply defines the point in time at which bound/assigned/attached resources are committed to the mdev device. In the model proposed here, the activate attribute allows resources to be over-committed, placing the burden on the management tools to implicitly understand which mdev devices can be used simultaneously. With the existing model, resources are committed on creation. I'll agree that such a model doesn't necessarily apply to these new devices, but I'd argue the next logical step is that resources are committed as they're attached such that the mdev device is always in a usable state. AFAICT, the same "freedom" is available to the admin by simply scripting mdev creation and assembly without a change to the mdev interface. > >> +6. Start Guest1: > >> + > >> + /usr/bin/qemu-system-s390x ... -cpu xxx,ap=on,apft=on \ > >> + -device vfio-ap,sysfsdev=/sys/devices/virtual/misc/vfio_ap/mdev_supported_types/vfio_ap-passthrough/devices/$uuid1 ... > >> + > >> +7. Start Guest2: > >> + > >> + /usr/bin/qemu-system-s390x ... -cpu xxx,ap=on,apft=on \ > >> + -device vfio-ap,sysfsdev=/sys/devices/virtual/misc/vfio_ap/mdev_supported_types/vfio_ap-passthrough/devices/$uuid2 ... > >> + > >> +When the guest is shut down, the mediated matrix device may be removed. > >> + > >> +Using our example again, to remove the mediated matrix device $uuid1: > >> + > >> + /sys/devices/virtual/misc > >> + --- [vfio_ap] > >> + --------- [mdev_supported_types] > >> + ------------ [vfio_ap-passthrough] > >> + --------------- [devices] > >> + ------------------ [$uuid1] > >> + --------------------- activate > >> + --------------------- remove > >> + > >> + > >> + echo 0 > activate > >> + echo 1 > remove > >> + > >> + This will release all the AP queues for the mediated device and > >> + remove all of the mdev matrix device's sysfs structures. To > >> + recreate and reconfigure the mdev matrix device, all of the steps starting > >> + with step 4 will have to be performed again. > >> + > >> + It is not necessary to remove an mdev matrix device, but one may want to > >> + remove it if no guest will use it during the lifetime of the linux host. If > >> + the mdev matrix device is removed, one may want to unbind the AP queues the > >> + guest was using from the vfio_ap device driver and bind them back to the > >> + default driver. Alternatively, the AP queues can be configured for another > >> + mdev matrix (i.e., guest). In either case, one must take care to change the > >> + secure key configured for the domain to which the queue is connected. > > > > This seems prime for data leakage, extremely sensitive data at that. > > Who's responsibility is it to prevent this? Shouldn't closing the > > device reset the device, which should perhaps wipe any key > > configuration? > > > > The documentation seems to be a bit outdated here. We do reset the device > in the release callback and on VFIO_DEVICE_RESET ioctl. There might be some > stuff that is not supposed to be reset by us, but that is fine (master keys > managed via special infrastructure called the TKE). I'm not sure what is the > meaning of last sentence of the paragraph in question ('change the secure > key configured ...') though. Anyway we will have to update this. > > >> + > >> + > >> +Limitations > >> +=========== > >> +An admin needs to be careful when writing to sysfs files > >> +/sys/bus/ap/apmask and /sys/bus/ap/aqmask. These files control the driver > >> +to which an AP queue is bound to. Once an AP queue is bound vfio_ap > >> +driver and assigned to a mediated device, admin should not re-assign the > >> +AP queues back to the default driver, because of technical limitations. > >> +The kernel does not prevent you from re-assigning from AP queues reserved > >> +for the vfio_ap driver back to default driver. Future enhancements in > >> +the ap_bus and vfio_ap are likely to introduce complete support for such > >> +dynamic reconfiguration. But until then be extremely careful. > > > > Why don't we have these enhancements now? Should the whole thing be > > marked experimental without them? This sounds similar to a vfio-pci > > case where a group with multiple devices could have device which is > > unused by the user unbound from vfio-pci and re-bound to a native host > > driver. We BUG_ON when this occurs to protect the data. > > I think, marking vfio_ap as experimental is probably a good idea. I think > we can do these enhancements later. Of course, having everything right away > is the best, but the problem with that is that it takes (more) time. > > > > > Probably also need to evaluate updates to > > Documentation/ABI/testing/sysfs-bus-vfio-mdev, especially if libvirt is > > supposed to interact with an 'activate' attribute. Thanks, > > > > I'm not sure documenting the 'activate' in Documentation/ABI/testing/sysfs-bus-vfio-mdev > is necessary. I see it more like the other type specific attributes. For > instance drivers/gpu/drm/i915/gvt/kvmgt.c seems to have vgpu_id and > https://docs.nvidia.com/grid/latest/grid-vgpu-user-guide/index.html#setting-vgpu-plugin-parameters-on-red-hat-el-kvm > suggests that Nvidia vgpus have a vgpu_params (read-write) attribute to > 'control the behavior of the vGPU, such as the frame rate limiter...'. I disagree, the difference is that the existing vendor attributes can modify the behavior of the mdev device, but are not part of the standard mdev sysfs ABI. You recommend above that 'activate' should be used prior to open(2) in order to commit mdev resources. That's a fundamental change in the management of the device, which is only obscured by the fact that open(2) does an implicit activate. Let's please just drop this whole notion of 'activate'. > AFAIK the activate and the assign/unassign attributes are because we can > not do much with homogeneous vfio_ap mdevs. The only safe default for > a vfio_ap mdev is all zero masks, that can't use any resources, that > is essentially useless. And this is how we construct the vfio_ap mdev > devices. In that sense we could as well construct them 'activated', > as empty won't result in any kind of conflicts (and allow the admin > to de-activate these so we are still flexible about the lifecycle > management). I agree that a zero mask is the only safe default for creation and that vendor specific attributes are necessary to build the configuration up to something that provides some usable value (unless we decide to expand the create interface). As above though, I disagree about what 'activate' does or does not provide. I don't see that it adds and flexibility in lifecycle management, at least not in a way that's compatible with existing mdev devices or in a way to facilitates upper level software management. It seems to only allow the admin to avoid dynamically creating and attaching resources to mdevs, at the cost of manage layers needing to account for over-commitment. Thanks, Alex