Re: [PATCH v7 22/22] s390: doc: detailed specifications for AP virtualization

Tony Krowiak <akrowiak@xxxxxxxxxxxxx> · Thu, 2 Aug 2018 17:59:42 -0400

On 08/02/2018 11:25 AM, Alex Williamson wrote:
On Wed, 1 Aug 2018 19:05:35 -0400
Tony Krowiak <akrowiak@xxxxxxxxxxxxx> wrote:

On 07/27/2018 01:44 PM, Alex Williamson wrote:
On Thu, 26 Jul 2018 21:54:29 +0200
Christian Borntraeger <borntraeger@xxxxxxxxxx> wrote:
...
+The process for reserving an AP queue for use by a KVM guest is:
+
+* The vfio-ap driver during its initialization will perform the following:
+  * Create the 'vfio_ap' root device - /sys/devices/virtual/misc/vfio_ap
+  * Create the 'matrix' device in the 'vfio_ap' root
+  * Register the matrix device with the device core
+* Register with the ap_bus for AP queue devices of type 10 devices (CEX4 and
+  newer) and to provide the vfio_ap driver's probe and remove callback
+  interfaces. The reason why older devices are not supported is because there
+  are no systems available on which to test.
+* The admin needs to reserve the AP Queue for vfio_ap driver. This can be
+  done by writing the AP adapter mask to /sys/bus/ap/apmask and the AP domain
+  mask to /sys/bus/ap/aqmask.
+
+  For example to reserve all the AP Queues on the system for vfio_ap driver:
+
+  echo 0x0000000000000000000000000000000000000000000000000000000000000000 > /sys/bus/ap/apmask
+  echo 0x0000000000000000000000000000000000000000000000000000000000000000 > /sys/bus/ap/aqmask
And this is a reasonable user syntax? ;)
I am not a fan of writing a 256-bit bitmask as such. It is entirely too
difficult to
figure out which character to specify to indicate a bit number to be set
for anything
past the first few characters. Of course, tooling could be used to
accomplish this
task, but the sysfs interfaces should probably be user-friendly. A
comma-separated
list could be used; for example to set reserve adapters 21, 99 and 249:

echo 21,99,249 > /sys/bus/ap/apmask

The problem with this is how do we indicate that all bits are to be
cleared? Do you
have any suggestions for a reasonable user syntax?
Perhaps an empty write, just a carriage return, "null".  I'd shy away
from magic digits, like 256 to clear.

The crypto team has decided to eliminate these sysfs attributes for the 
AP bus
and rely on bind/unbind.

...
+  * mdev_attr_groups
+    This attribute group identifies the user-defined sysfs attributes of the
+    mediated device. When a device is registered with the VFIO mediated device
+    framework, the sysfs attributes files identified in the 'mdev_attr_groups'
+    structure will be created in the mediated matrix device's directory. The
+    sysfs attributes for a mediated matrix device are:
+    * assign_adapter:
+      A write-only file for assigning an AP adapter to the mediated matrix
+      device. To assign an adapter, the APID of the adapter is written to the
+      file.
+    * assign_domain:
+      A write-only file for assigning an AP usage domain to the mediated matrix
+      device. To assign a domain, the APQI of the AP queue corresponding to a
+      usage domain is written to the file.
+    * matrix:
+      A read-only file for displaying the APQNs derived from the adapters and
+      domains assigned to the mediated matrix device.
+    * assign_control_domain:
+      A write-only file for assigning an AP control domain to the mediated
+      matrix device. To assign a control domain, the ID of a domain to be
+      controlled is written to the file. For the initial implementation, the set
+      of control domains will always include the set of usage domains, so it is
+      only necessary to assign control domains that are not also assigned as
+      usage domains.
How will the user know when this changes?  What's the transition plan
to maintain compatibility when it does change?
I'm sorry, I don't understand the question. When you say user, about
whom are
you talking; the person doing the assignments? What changes are you talking
about?
The user would be the person interaction with the sysfs attribute
assign_control_domain where it states "[f]or the initial implementation,
the set of control domains will always include the set of usage
domains, so it is only necessary to assign control domains that are not
also assigned as usage domains."  I infer from the usage of "initial
implementation" that this behavior may not always be the case and
therefore if it's not final what is the transition plan so that a user
interacting with this attribute can know the current behavior.

Ah, I get your point now. I see how this verbiage would lead you to that
conclusion and it will be removed. This behavior will not change.

...
+4. The administrator now needs to configure the matrixes for mediated
+   devices $uuid1 (for Guest1) and $uuid2 (for Guest2).
+
+   This is how the matrix is configured for Guest1:
+
+   echo 5 > assign_adapter
+   echo 6 > assign_adapter
+   echo 4 > assign_domain
+   echo 0xab > assign_domain
+
+   For this implementation, all usage domains - i.e., domains assigned
+   via the assign_domain attribute file - will also be configured in the ADM
+   field of the KVM guest's CRYCB, so there is no need to assign control
+   domains here unless you want to assign control domains that are not
+   assigned as usage domains.
+
+   If a mistake is made configuring an adapter, domain or control domain,
+   you can use the unassign_xxx files to unassign the adapter, domain or
+   control domain.
Would it be more consistent with other sysfs entries to call these
"bind" and "unbind" rather than "assign" and "unassign
Aren't the bind and unbind sysfs interfaces typically associated with
binding devices to and unbinding devices from a device driver?
Yes

In this
case we are talking about configuring a mediated device's AP matrix - i.e.,
setting bits identifying the adapters, domains and control domains defined
for the mediated device (i.e., a guest). Maybe there are better words we
could use than assign/unassign, but I find the use of bind/unbind to be
confusing. I'm open to suggestions.
If we remove the 'activate' interface and dedicate resources to the
mdev device at the point where we create an intersection in the matrix,
wouldn't that resource be "bound" to the mdev at that time?  This is
not a strong issue for me, I just thought "assign" seemed inconsistent
as the operation felt more like a traditional "bind".

We are going to remove the activate for the next series, but stick with
the assign terminology because we're not really binding devices, but 
constructing a
matrix that identifies the AP queues to which the guest using the mdev
will have access.

+
+   To display the matrix configuration for Guest1:
+
+   cat matrix
+
+   This is how the matrix is configured for Guest2:
+
+   echo 5 > assign_adapter
+   echo 0x47 > assign_domain
+   echo 0xff > assign_domain
+
+5. The adminstrator now needs to activate the mediated devices.
+   echo 1 > activate
Or optionally not.  As in reply to cover letter, I don't think this
interface is sufficiently justified, or necessarily desirable.
To clarify your objection, let me state what I think you are saying.

1. You do not think there is a good reason to assign duplicate APQNs
     - i.e., the cross product of all adapter IDs and domain IDs assigned
     to the mdev.
I think the cross product of adapters and domains is a basic part of
the design here, what I object to is that mdevs can exist with
overlapping cross products and which one can be activated is dependent
on activation ordering.  For example, if I have the following mdevs:

{adapters}{domains}
A: {0,1}{0,1}
B: {0}{0}
C: {0}{1}
D: {1}{0}
E: {1}{1}

Who is supposed to understand and debug that A cannot be activated
while any of B/C/D/E are activated?  What does the debugging process
look like?  If the mdev is not explicitly activated, only opened, how
do we determine the fault is from a resource conflict on the mdev
definition?  On the other hand if the cross product is committed at the
time it's specified, then a scenario where E already exists and we're
trying to create A might look like:

1. Specify adapter mask of {0,1}
2. Specify domain mask of {0,1} <-- write(2) fails with -EBUSY

Here we know that one or more of the domains we've specified are not
available on the set of adapters we've configured.

This is moot for the next series because the activate will be removed and
consistency checking will be done as adapters/domains are assigned.

2. All mdevs should be validly configured for use regardless of when
     guests using them are started. In other words, libvirt should be
     able to depend on the fact that an mdev is ready for use at all
     times.
Yes, an mdev should represent committed resources.  We support dynamic
creation and removal if the resources need to be freed for the host or
another mdev configuration.

See my previous response.

I will discuss this with the team when we meet on Thursday and get
back to you. If I am mistaken in my analysis of your concern, please
clarify.

I would like to mention that when a guest using an mdev starts,
the mdev will be activated if not already activated. The guest
will not start if another activated mdev is using an APQN assigned
to the mdev of the guest being started.
Understood.

+6. Start Guest1:
+
+   /usr/bin/qemu-system-s390x ... -cpu xxx,ap=on,apft=on \
+      -device vfio-ap,sysfsdev=/sys/devices/virtual/misc/vfio_ap/mdev_supported_types/vfio_ap-passthrough/devices/$uuid1 ...
+
+7. Start Guest2:
+
+   /usr/bin/qemu-system-s390x ... -cpu xxx,ap=on,apft=on \
+      -device vfio-ap,sysfsdev=/sys/devices/virtual/misc/vfio_ap/mdev_supported_types/vfio_ap-passthrough/devices/$uuid2 ...
+
+When the guest is shut down, the mediated matrix device may be removed.
+
+Using our example again, to remove the mediated matrix device $uuid1:
+
+   /sys/devices/virtual/misc
+      --- [vfio_ap]
+      --------- [mdev_supported_types]
+      ------------ [vfio_ap-passthrough]
+      --------------- [devices]
+      ------------------ [$uuid1]
+      --------------------- activate
+      --------------------- remove
+
+
+   echo 0 > activate
+   echo 1 > remove
+
+   This will release all the AP queues for the mediated device and
+   remove all of the mdev matrix device's sysfs structures. To
+   recreate and reconfigure the mdev matrix device, all of the steps starting
+   with step 4 will have to be performed again.
+
+   It is not necessary to remove an mdev matrix device, but one may want to
+   remove it if no guest will use it during the lifetime of the linux host. If
+   the mdev matrix device is removed, one may want to unbind the AP queues the
+   guest was using from the vfio_ap device driver and bind them back to the
+   default driver. Alternatively, the AP queues can be configured for another
+   mdev matrix (i.e., guest). In either case, one must take care to change the
+   secure key configured for the domain to which the queue is connected.
This seems prime for data leakage, extremely sensitive data at that.
Who's responsibility is it to prevent this?  Shouldn't closing the
device reset the device, which should perhaps wipe any key
configuration?
The device is zeroized (reset) when it is removed. The zeroize
instruction - to
the best of my knowledge - does nothing with the secure key for a given
queue.
There is also no instruction that I know of to clear keys, so it would
appear
that this is left to the system administrator.
:-o  That seems like a pretty serious fault.  Given that this interface
is specifically for crypto devices where such sensitive information
seems common, doesn't it seem like the kernel, ie. the mdev vendor
driver, should provide guarantees that should your user instance crash
or be killed that keys are not available to the system admin?

I may have misspoke with regard to my understanding of the AP reset. I've
been told that the keys stored with the domain will be reset. I will
verify this and get back to you.

+
+
+Limitations
+===========
+An admin needs to be careful when writing to sysfs files
+/sys/bus/ap/apmask and /sys/bus/ap/aqmask. These files control the driver
+to which an AP queue is bound to. Once an AP queue is bound vfio_ap
+driver and assigned to a mediated device, admin should not re-assign the
+AP queues back to the default driver, because of technical limitations.
+The kernel does not prevent you from re-assigning from AP queues reserved
+for the vfio_ap driver back to default driver.  Future enhancements in
+the ap_bus and vfio_ap are likely to introduce complete support for such
+dynamic reconfiguration. But until then be extremely careful.
Why don't we have these enhancements now?  Should the whole thing be
marked experimental without them?  This sounds similar to a vfio-pci
case where a group with multiple devices could have device which is
unused by the user unbound from vfio-pci and re-bound to a native host
driver.  We BUG_ON when this occurs to protect the data.
We have customers that have been waiting a long time and are very anxious
to have guest access to AP devices. In the interest of reducing time to
market, we have decided to simplify the initial implementation as much
as possible. I wouldn't necessarily call this experimental, but view it
more as the minimal viable product.

As far as BUG_ON, I will discuss this with the crypto team at our
meeting on Thursday and get back to you.
A minimum viable product still needs to protect the host and does not
excuse us from later breaking the user exposed ABI.

I will have a more thorough look into it.

Probably also need to evaluate updates to
Documentation/ABI/testing/sysfs-bus-vfio-mdev, especially if libvirt is
supposed to interact with an 'activate' attribute.  Thanks,
As it stands, libvirt will not create or modify the attributes of an mdev.
If a guest is started under libvirt and its mdev has not been activated,
it will be activated when the mdev fd is opened. Of course, if another
activated mdev is assigned an APQN assigned to the mdev of the guest
being started, the guest will be terminated.
Exactly why I disapprove of the activated interface, but it's being
proposed and recommended for use as a standard part of the lifecycle
management for these devices.  Thanks,

As I stated elsewhere, we are removing the activate interface. All
consistency checking will be done as adapters/domains are assigned to
the mdev.

Alex

--
To unsubscribe from this list: send the line "unsubscribe linux-s390" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html