Re: [PATCH v7 22/22] s390: doc: detailed specifications for AP virtualization

Halil Pasic <pasic@xxxxxxxxxxxxx> · Thu, 2 Aug 2018 16:56:45 +0200

On 08/01/2018 08:08 PM, Alex Williamson wrote:
On Wed, 1 Aug 2018 13:12:05 +0200
Halil Pasic <pasic@xxxxxxxxxxxxx> wrote:

On 07/27/2018 07:44 PM, Alex Williamson wrote:
On Thu, 26 Jul 2018 21:54:29 +0200
Christian Borntraeger <borntraeger@xxxxxxxxxx> wrote:
...
+The process for reserving an AP queue for use by a KVM guest is:
+
+* The vfio-ap driver during its initialization will perform the following:
+  * Create the 'vfio_ap' root device - /sys/devices/virtual/misc/vfio_ap
+  * Create the 'matrix' device in the 'vfio_ap' root
+  * Register the matrix device with the device core
+* Register with the ap_bus for AP queue devices of type 10 devices (CEX4 and
+  newer) and to provide the vfio_ap driver's probe and remove callback
+  interfaces. The reason why older devices are not supported is because there
+  are no systems available on which to test.
+* The admin needs to reserve the AP Queue for vfio_ap driver. This can be
+  done by writing the AP adapter mask to /sys/bus/ap/apmask and the AP domain
+  mask to /sys/bus/ap/aqmask.
+
+  For example to reserve all the AP Queues on the system for vfio_ap driver:
+
+  echo 0x0000000000000000000000000000000000000000000000000000000000000000 > /sys/bus/ap/apmask
+  echo 0x0000000000000000000000000000000000000000000000000000000000000000 > /sys/bus/ap/aqmask

And this is a reasonable user syntax? ;)

We intend to provide tooling that makes this more human friendly. BTW
echo 0 > sys/bus/ap/apmask would also work -- and looks less scary.

If tooling is required, perhaps the interface is wrong by design.
Maybe the interface should accept a list of bits, for example:

echo 255,0 > /sys/bus/ap/aqmask

vs

echo 0x8000000000000000000000000000000000000000000000000000000000000001 > /sys/bus/ap/aqmask

Sorry I'm not the right person to have this discussion with. Just wanted to
provide some background as someone who was there when this was discussed internally.
Harald should be back next week.

[..]

Would it be more consistent with other sysfs entries to call these
"bind" and "unbind" rather than "assign" and "unassign"?

I'm not sure. Let me recap what is this about. First we partition our AP
resources (queues) into 'to be used by the host' and 'to be used by the
guests' using the /sys/bus/ap/a[pq]mask. Then we sub-partition 'to be
used by the guests' among the guests. Each vfio ap_matrix represents a
partition that corresponds to a single VM (guest). Such a partition can
be seen as a device that grants access (usage or administration) to a set
of queues. We can grant access to queues even if these are not currently
available in the system. E.g. we can partition an AP card that is
currently not online or even plugged. When The resources become available
(e.g. new card plugged, or just 'configured'), the resources (should)
become available at all levels if the authorization is right. I know
'bind' binding a device (that currently exists within the system) to a
driver. The vfio_ap stuff seems different enough to seek a different name
for it.

As discussed elsewhere, I'm not in favor of allowing mdev devices to
exist which rely on resources that are not present or in-use
elsewhere.  If we remove that concept, then we are actually binding and
unbinding resources to the mdev device.

I understand and respect your point. I'm torn between the mdev perspective
and the perspective given by the architecture. On s390x KVM is a second and
higher level hyperviseor, and I prefer if the things are consistent among
the different levels of virtualization (if possible). What I described (granting
and gaining access to AP resources that are not a part of the configuration at
a certain moment) is how things work at LPAR level.

In use elsewhere is a taboo of course, and we check for that. Here we are
completely on the same page. But I fail to see why 'not (yet) present but
guaranteed to be 'ours' once it shows up (if)' has to be excluded on
KVM level if it is valid on LPAR level. Can you help me understand that
aspect?

+
+   To display the matrix configuration for Guest1:
+
+   cat matrix
+
+   This is how the matrix is configured for Guest2:
+
+   echo 5 > assign_adapter
+   echo 0x47 > assign_domain
+   echo 0xff > assign_domain
+
+5. The adminstrator now needs to activate the mediated devices.
+   echo 1 > activate

Or optionally not.  As in reply to cover letter, I don't think this
interface is sufficiently justified, or necessarily desirable.

You are right. Activation is not necessary because the open will try to
auto-activate it. But honestly, I would recommend activating it in
advance, so the admin can be sure the resources are claimed.

This can also be solved by claiming the resource at the point at which
it is bound/assigned/attached to the mdev device (ie. at any point
where we introduce new intersections in the matrix).

See below.

In our current design we choose to give the administrator the freedom, to
decide on his policy how does he want to manage the vfio_ap mdevs
lifecycle. And  to decide what guarantees does he want to have at a given
point in time. With 'activate' both 'dynamically and on-demand of
starting a VM' and 'create known vfio_ap mdevs at system bring up' are
viable. If there were a consensus on how the life-cycle of the mdev
devices is to be managed, it would be easier to not give this freedom to
the admin.

Please, this is not a freedom of the admin issue, this simply defines
the point in time at which bound/assigned/attached resources are
committed to the mdev device.  In the model proposed here, the activate
attribute allows resources to be over-committed, placing the burden on
the management tools to implicitly understand which mdev devices can be
used simultaneously. 

We have never seen this as the responsibility of the management software. The
idea was that this would be another thing were management software does not have
complete understanding. If the admin decides to do things that way activating
the vfio_ap mdev was just supposed to fail.

With the existing model, resources are committed
on creation.  I'll agree that such a model doesn't necessarily apply to
these new devices, but I'd argue the next logical step is that
resources are committed as they're attached such that the mdev device
is always in a usable state.

Having the device in a not usable state is indeed very ugly. When
we were discussing this internally I was arguing for an atomic bulk
update of the matrix that either succeeds or fails. Makes things easier
to think about. If we go by the acquire the diff on each assign/(bind)
operation we kind of have an transaction to get in the desired state.
Deadlocks may happen if no protocol is established, and similar.

Please also consider we don't add one resource at a time. If we have the
the adapters 0, 1 and 2 assigned, but no domains, and we assign domain
0 we 'get' not one but three queues.

The 'activate' is a quite new, and in my opinion somewhat messy concept
that was proposed during the workshop mentioned by Christian and introduced
in this round.

One of the things that is associated with 'activate' is kind of an 'edit'
and 'commit/save' workflow. Especially thinking about changing the authorizations
(the matrix) for a running VM (which is not supported yet), not being
able to do it in one go does not sound like a good idea to me. But that would
effectively mean that we would have to remain functional with the 'old' set
of resources (initially empty). The we could maintain device is operational
with the claimed  resources an invariant.

AFAICT, the same "freedom" is available
to the admin by simply scripting mdev creation and assembly without a
change to the mdev interface.

You mean the qemu hooks in libvirt I guess. I don't quite understand. AFAIU
if we use the hooks we commit to populate (and create) the vfio_ap mdevs
dynamically on demand.

AFAIU we can not have all three of the following at the same time
1) mutually overlapping definitions of vfio_ap devcies
2) construct and populate mdevs according to each of these definitions
   e.g at system bring up time.
3) Maintain all 'resources claimed' as an invariant for a vfio_ap mdev.

Your proposal resolves this by compromising on: not creating and populating
a vfio_ap device for each definition, but only for the ones that are actually
needed. I used to argue along the same lines, but I'm not sure how acceptable
is this hooks based solution to our management software people.

[..]

I'm not sure documenting the 'activate' in Documentation/ABI/testing/sysfs-bus-vfio-mdev
is necessary. I see it more like the other type specific attributes. For
instance drivers/gpu/drm/i915/gvt/kvmgt.c seems to have vgpu_id and
https://docs.nvidia.com/grid/latest/grid-vgpu-user-guide/index.html#setting-vgpu-plugin-parameters-on-red-hat-el-kvm
suggests that Nvidia vgpus have a vgpu_params (read-write) attribute to
'control the behavior of the vGPU, such as the frame rate limiter...'.

I disagree, the difference is that the existing vendor attributes can
modify the behavior of the mdev device, but are not part of the
standard mdev sysfs ABI.  You recommend above that 'activate' should be
used prior to open(2) in order to commit mdev resources.  That's a
fundamental change in the management of the device, which is only
obscured by the fact that open(2) does an implicit activate.

I agree, you have a point. Requiring to do an activate before doing an
open is a heavy interference with the mdev ABI. And making the open
fail is ugly indeed.

Let's please just drop this whole notion of 'activate'.

AFAIK the activate and the assign/unassign attributes are because we can
not do much with homogeneous vfio_ap mdevs. The only safe default for
a vfio_ap mdev is all zero masks, that can't use any resources, that
is essentially useless. And this is how we construct the vfio_ap mdev
devices. In that sense we could as well construct them 'activated',
as empty won't result in any kind of conflicts (and allow the admin
to de-activate these so we are still flexible about the lifecycle
management).

I agree that a zero mask is the only safe default for creation and that
vendor specific attributes are necessary to build the configuration up
to something that provides some usable value (unless we decide to
expand the create interface).  As above though, I disagree about what
'activate' does or does not provide.  I don't see that it adds and
flexibility in lifecycle management, at least not in a way that's
compatible with existing mdev devices or in a way to facilitates upper
level software management.  It seems to only allow the admin to avoid
dynamically creating and attaching resources to mdevs, at the cost of
manage layers needing to account for over-commitment.  Thanks,

I tried to clarify my views on some of these points (responsibility of
management layers regarding over-commitment, and the flexibility in lifecycle
management). The truth is, I used to argue against 'activate', and I would
be very happy to forget about it if we can come up with something better.

It's a bit unfortunate that I will be on a vacation (I'm already on
vacation) for some time. I will try to stay tuned though. Thanks for the
nice and productive discussion. We (and especially I) learned a lot about your
take on the problem.

Cheers,
Halil

--
To unsubscribe from this list: send the line "unsubscribe linux-s390" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html