Re: RFC: vfio-pci API for PCI bus/slot (hot) resets

Benjamin Herrenschmidt <benh@xxxxxxxxxxxxxxxxxxx> · Sat, 03 Aug 2013 07:28:54 +1000

On Fri, 2013-08-02 at 10:36 -0600, Alex Williamson wrote:

> > Also in some cases, at least for us, we have a problem where the scope
> > of the reset can be larger than the group ... in this case I think we
> > need to forbid the reset.
> 
> Interesting, I would have ventured to guess that resets are always
> contained within a single group given the PE isolation domains rooted at
> a PHB. 

Depends how I configure them. For example, if I assign two functions to
two different PEs (SR-IOV typically), a hot reset will kill both.

For fundamental reset, depending on how the mobo is wired up and other
platform things, I might for example be limited on doing a PERST at the
PHB level, thus accross multiple PEs.

>  Why disallow it though?  What I propose below would still allow
> it if the user can prove that they own all of the affected groups.

Right, I mean disallow it if the PEs are not all owned by the same user.
What you propose would probably work.

>   Even
> on x86, I think it's too restrictive to only allow reset if the affect
> is contained within a group.  It's just too easy to need it for a
> multi-function device.  We could hope that a device that implements ACS
> also implements some sort of FLR, but that may just be wishful thinking.

Right. Another use case is, I know of devices that need a fundamental
reset (PERST) after applying a FW update.

> > For us, it's also tied into our "EEH" error handling. IE. The way the
> > guest will request some of these things is going to be via firmware APIs
> > that we have yet to implement, but that we were also thinking of
> > implementing entire in the kernel rather than qemu for various
> > reasons... IE. There is more to error handling & recovery in our case at
> > least than AER and reset :-)
> 
> I've been focused on reset for device re-initialization and
> repeatability, but that's a good point, we need to consider error
> recovery as a use of this interface as well.

Unfortunately, with EEH we have very specific FW interface (including
low level diagnostic data "blobs" etc...) that we need to implement so I
am not sure we can "fit" it in a generic interface.

We might want to attempt to list our interfaces and see. CC'ing Gavin
who is our EEH expert.

A rough cut is

 - We have the concept of "frozen" PE (frozen MMIO and frozen DMA, two
separate attributes). So we can query and change the freeze state on a
PE.

 - We have various level of error severity (from informational
(corrected) to full PHB need a reset with some variations in the middle
of PEs being frozen etc...)

 - We have diag blobs coming out of FW for the guest to log (and
possibly partially decode).

 - We have interfaces for various types of resets

> > I need to spend more time reading your proposal and see how it can work
> > for us (or not...)... but we might end up just doing our own thing that
> > carries the whole EEH API instead.
> 
> Obviously it would be great if it could work for you, but regardless of
> whether you intend to use it I'd appreciate feedback.  Thanks,

Ben.

> Alex
> 
> > > ---
> > > Mechanism to do PCI hot resets through VFIO:
> > > 
> > > VFIO is fundamentally an IOMMU group and device level interface.
> > > There's no concept of buses, slots, or hierarchies of devices.  There
> > > are only IOMMU group and devices.  A bus (or slot) may contain exactly
> > > one IOMMU group, multiple IOMMU groups, or a portion of an IOMMU group.
> > > An IOMMU group may contain one or more devices.
> > > 
> > > The first question is perhaps where should we create the interface to do
> > > a PCI hot reset.  Assuming an ioctl interface, our choices are the
> > > group, the container, or the device file descriptors.  Groups and
> > > containers are not PCI specific, so an extension on either of those
> > > doesn't make much sense.  They also don't have much granularity if your
> > > goal is to do a hot reset on the smallest subset of devices you can.
> > > Therefore the only choice seems to be a VFIO device level interface.
> > > 
> > > The fact that a hot reset affects multiple devices also raises concerns.
> > > How do we make sure a user has sufficient access/privilege to perform
> > > this operation?  If all of the affected devices are within the same
> > > group, then we know the user already "owns" all those devices.  Using
> > > groups as the boundary excludes a number of use cases though.  The user
> > > would need to prove that they also own the other groups or devices that
> > > are affected by the reset.  This might be multiple groups, so the ioctl
> > > quickly grows to requiring a list of file descriptors be passed for
> > > validation.
> > > 
> > > We already use the group file descriptor as a unit of ownership for
> > > enabling the container, so it seems like it would make sense to use it
> > > here too.  The alternative is a device file descriptor, but groups may
> > > encompass devices the user doesn't care to use and we don't want to
> > > require that they open a file descriptor simply to perform a hot reset.
> > > Groups can also contain devices that the user cannot open, for instance
> > > those owned by VFIO "compatible" drivers like pci-stub or pcieport.
> > > 
> > > The user also needs to know the set of devices affected by a hot reset,
> > > otherwise they have no idea which group file descriptors to pass to such
> > > an interface.  That implies we also need a separate "info" ioctl for the
> > > user to learn that information.  We could argue that the user could
> > > learn this information from sysfs, but that imposes non-trivial library
> > > or code overhead on the user to evaluate the topology.  The PCI hot
> > > reset info ioctl would need to indicate whether a hot reset is
> > > available, and the set of affected devices.  It may be useful to provide
> > > this as a {group, device} pair so the user doesn't need to
> > > cross-reference each device with sysfs to determine the group for the
> > > device.  This would then provide both the set of groups required to
> > > perform the hot reset and the set of devices affected by the hot reset.
> > > 
> > > As an alternative, we could consider simply requiring that all of the
> > > devices affected by a hot reset belong to the same VFIO container.
> > > However, allowing multiple groups per container is an optional IOMMU
> > > capability that really has no relation to PCI bus/slot boundaries.  It
> > > seems a bit arbitrary to require groups be placed in the same container
> > > to get a PCI hot reset.  That likely means we'd still need to support
> > > passing some kind of ownership token as above with groups.  So it
> > > doesn't seem to make the situation any better.
> > > 
> > > Given the above discussion, I therefore propose the following PCI hot
> > > reset interface:
> > > 
> > > /**
> > >  * VFIO_DEVICE_PCI_HOT_RESET_INFO - _IOWR(VFIO_TYPE, VFIO_BASE + ??,
> > >  *                                        struct vfio_device_pci_hot_reset_info)
> > >  */
> > > 
> > > #define VFIO_DEVICE_PCI_HOT_RESET_INFO     _IO(VFIO_TYPE, VFIO_BASE + ??)
> > > 
> > > struct vfio_device_pci_hot_reset_info_entry {
> > > 	__u32	group_id;
> > > 	__u16	segment; /* A reset will never include devices on other segments... return it anyway */
> > > 	__u8	bus;
> > > 	__u8	devfn; /* Use PCI_SLOT/PCI_FUNC */
> > > };
> > > 
> > > struct vfio_device_pci_hot_reset_info {
> > > 	__u32	argsz;
> > > 	__u32	flags;
> > > #define VFIO_PCI_HOT_RESET_SUPPORTED	(1 << 0) /* Device supports hot reset */
> > > #define VFIO_PCI_HOT_RESET_POPULATED	(1 << 1) /* Entries field are populated */
> > > 	__u32	count;
> > > 	struct vfio_device_pci_hot_reset_info_entry	entries[];
> > > };
> > > 
> > > The user calls VFIO_DEVICE_PCI_HOT_RESET_INFO on a VFIO device file
> > > descriptor with a struct vfio_device_pci_hot_reset_info data structure,
> > > minimally the sizeof the struct with argsz set.  VFIO returns whether
> > > hot reset is supported and the number of devices affected by the reset.
> > > If argsz is big enough, VFIO will fill in the entries and set the
> > > populated flag, otherwise the caller can reallocate the structure and
> > > try again.
> > > 
> > > /**
> > >  * VFIO_DEVICE_PCI_HOT_RESET - _IOW(VFIO_TYPE, VFIO_BASE + ??,
> > >  *                                  struct vfio_device_pci_hot_reset)
> > >  */
> > > 
> > > #define VFIO_DEVICE_PCI_HOT_RESET     _IO(VFIO_TYPE, VFIO_BASE + ??)
> > > 
> > > struct vfio_device_pci_hot_reset {
> > > 	__u32	argsz;
> > > 	__u32	flags;
> > > 	__u32	count;
> > > 	__u32	fds[];
> > > };
> > > 
> > > As above, the user calls VFIO_DEVICE_PCI_HOT_RESET on a VFIO device file
> > > descriptor.  Using the list from PCI_HOT_RESET_INFO, the user allocates
> > > a struct vfio_device_pci_hot_reset of sufficient size to pass a list of
> > > VFIO group file descriptors.  There should be one file descriptor for
> > > each group listed in the info entries.  If the list of groups matches
> > > those affected by a hot reset of the device, then VFIO will perform the
> > > hot reset action and return success.
> > > 
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe kvm" in
> > > the body of a message to majordomo@xxxxxxxxxxxxxxx
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > 
> > 
> 
> 

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html