RE: Plan for /dev/ioasid RFC v2

"Tian, Kevin" <kevin.tian@xxxxxxxxx> · Fri, 25 Jun 2021 10:27:18 +0000

Hi, Alex/Joerg/Jason,

Want to draw your attention on an updated proposal below. Let's see
whether there is a converged direction to move forward. 😊

> From: Jason Gunthorpe <jgg@xxxxxxxxxx>
> Sent: Saturday, June 19, 2021 2:23 AM
> 
> On Fri, Jun 18, 2021 at 04:57:40PM +0000, Tian, Kevin wrote:
> > > From: Jason Gunthorpe <jgg@xxxxxxxxxx>
> > > Sent: Friday, June 18, 2021 8:20 AM
> > >
> > > On Thu, Jun 17, 2021 at 03:14:52PM -0600, Alex Williamson wrote:
> > >
> > > > I've referred to this as a limitation of type1, that we can't put
> > > > devices within the same group into different address spaces, such as
> > > > behind separate vRoot-Ports in a vIOMMU config, but really, who cares?
> > > > As isolation support improves we see fewer multi-device groups, this
> > > > scenario becomes the exception.  Buy better hardware to use the
> devices
> > > > independently.
> > >
> > > This is basically my thinking too, but my conclusion is that we should
> > > not continue to make groups central to the API.
> > >
> > > As I've explained to David this is actually causing functional
> > > problems and mess - and I don't see a clean way to keep groups central
> > > but still have the device in control of what is happening. We need
> > > this device <-> iommu connection to be direct to robustly model all
> > > the things that are in the RFC.
> > >
> > > To keep groups central someone needs to sketch out how to solve
> > > today's mdev SW page table and mdev PASID issues in a clean
> > > way. Device centric is my suggestion on how to make it clean, but I
> > > haven't heard an alternative??
> > >
> > > So, I view the purpose of this discussion to scope out what a
> > > device-centric world looks like and then if we can securely fit in the
> > > legacy non-isolated world on top of that clean future oriented
> > > API. Then decide if it is work worth doing or not.
> > >
> > > To my mind it looks like it is not so bad, granted not every detail is
> > > clear, and no code has be sketched, but I don't see a big scary
> > > blocker emerging. An extra ioctl or two, some special logic that
> > > activates for >1 device groups that looks a lot like VFIO's current
> > > logic..
> > >
> > > At some level I would be perfectly fine if we made the group FD part
> > > of the API for >1 device groups - except that complexifies every user
> > > space implementation to deal with that. It doesn't feel like a good
> > > trade off.
> > >
> >
> > Would it be an acceptable tradeoff by leaving >1 device groups
> > supported only via legacy VFIO (which is anyway kept for backward
> > compatibility), if we think such scenario is being deprecated over
> > time (thus little value to add new features on it)? Then all new
> > sub-systems including vdpa and new vfio only support singleton
> > device group via /dev/iommu...
> 
> That might just be a great idea - userspace has to support those APIs
> anyhow, if it can be made trivially obvious to use this fallback even
> though /dev/iommu is available it is a great place to start. It also
> means PASID/etc are naturally blocked off.
> 
> Maybe years down the road we will want to harmonize them, so I would
> still sketch it out enough to be confident it could be implemented..
> 

First let's align on the high level goal of supporting multi-devices group 
via IOMMU fd. Based on previous discussions I feel it's fair to say that 
we will not provide new features beyond what vfio group delivers today,
which implies:

1) All devices within the group must share the same address space.

        Though it's possible to support multiple address spaces (e.g. if caused
        by !ACS), there are some scenarios (DMA aliasing, RID sharing, etc.)
        where a single address space is mandatory. The effort to support
        multiple spaces is not worthwhile due to improved isolation over time.

2) It's not necessary to bind all devices within the group to the IOMMU fd.

        Other devices could be left unused, or bound to a known driver which
        doesn't do DMA. This implies a group viability mechanism must be in
        place which can identify when the group is viable for operation and 
        BUG_ON() when the viability is changed due to user action.

3) User must be denied from accessing a device before its group is attached
     to a known security context.

If above goals are agreed, below is the updated proposal for supporting
multi-devices group via device-centric API. Most ideas come from Jason.
Here try to expand and compose them in a full picture.

In general:

-   vfio keeps existing uAPI sequence, with slightly different semantics:

        a) VFIO_GROUP_SET_CONTAINER, as today

        b) VFIO_SET_IOMMU with a new iommu type (VFIO_EXTERNAL_
             IOMMU) which, once set, tells VFIO not to establish its own
             security context.

        c)  VFIO_GROUP_GET_DEVICE_FD_NEW, carrying additional info
             about external iommu driver (iommu_fd, device_cookie). This
             call automatically binds the device to iommu_fd. Device fd is
             returned to the user only after successful binding which implies 
             a security context (BLOCK_DMA) has been established for the 
             entire group. Since the security context is managed by iommu_fd,
             group viable check should be done in the iommu layer thus 
             vfio_group_viable() mechanism is redundant in this case.

-   When receiving the binding call for the 1st device in a group, iommu_fd 
    calls iommu_group_set_block_dma(group, dev->driver) which does 
    several things:

        a) Check group viability. A group is viable only when all devices in
            the group are in one of below states:

                * driver-less
                * bound to a driver which is same as dev->driver (vfio in this case)
                * bound to an otherwise allowed driver (same list as in vfio)

        b) Set block_dma flag for the group and configure the IOMMU to block
            DMA for all devices in this group. This could be done by attaching to
            a dedicated iommu domain (IOMMU_DOMAIN_BLOCKED) which has
            an empty page table.

        c) The iommu layer also verifies group viability on BUS_NOTIFY_
            BOUND_DRIVER event. BUG_ON if viability is broken while block_dma
            is set.

-   Binding other devices in the group to iommu_fd just succeeds since 
    the group is already in block_dma.

-   When a group is in block_dma state, all devices in the group (even not
    bound to iommu_fd) switch together between blocked domain and 
    IOASID domain, initiated by attaching to or detaching from an IOASID.

        a) iommu_fd verifies that all bound devices in the same group must be
            attached to a single IOASID.

        b) the 1st device attach in the group calls iommu API to move the 
             entire group to use the new IOASID domain.

        c) the last device detach calls iommu API to move the entire group 
            back to the blocked domain. 

-   A device is allowed to be unbound from iommu_fd when other devices
    in the group are still bound. In this case the group is still in block_dma
    state thus the unbound device should not be bound to another driver
    which could break the group viability.

         a) for vfio this unbound is automatically done when device fd is closed.

-   When vfio requests to unbind the last device in the group, iommu_fd
    calls iommu_group_unset_block_dma(group) to move the group out
    of the block_dma state. Devices in the group are re-attached to the 
    default domain from now on.

With this design all the helper functions and uAPI are kept device-centric
in iommu_fd. It maintains minimal group knowledge internally by tracking 
device binding/attaching status within each group and then calling proper
iommu API upon changed group status.

VFIO still keeps its container/group/device semantics for backward
compatibility.

A new subsystem can completely eliminate group semantics as long as
it could find a way to finish device binding before granting user to
access the device. 

Thanks
Kevin