On Fri, 2013-08-02 at 10:36 -0600, Alex Williamson wrote: > > Also in some cases, at least for us, we have a problem where the scope > > of the reset can be larger than the group ... in this case I think we > > need to forbid the reset. > > Interesting, I would have ventured to guess that resets are always > contained within a single group given the PE isolation domains rooted at > a PHB. Depends how I configure them. For example, if I assign two functions to two different PEs (SR-IOV typically), a hot reset will kill both. For fundamental reset, depending on how the mobo is wired up and other platform things, I might for example be limited on doing a PERST at the PHB level, thus accross multiple PEs. > Why disallow it though? What I propose below would still allow > it if the user can prove that they own all of the affected groups. Right, I mean disallow it if the PEs are not all owned by the same user. What you propose would probably work. > Even > on x86, I think it's too restrictive to only allow reset if the affect > is contained within a group. It's just too easy to need it for a > multi-function device. We could hope that a device that implements ACS > also implements some sort of FLR, but that may just be wishful thinking. Right. Another use case is, I know of devices that need a fundamental reset (PERST) after applying a FW update. > > For us, it's also tied into our "EEH" error handling. IE. The way the > > guest will request some of these things is going to be via firmware APIs > > that we have yet to implement, but that we were also thinking of > > implementing entire in the kernel rather than qemu for various > > reasons... IE. There is more to error handling & recovery in our case at > > least than AER and reset :-) > > I've been focused on reset for device re-initialization and > repeatability, but that's a good point, we need to consider error > recovery as a use of this interface as well. Unfortunately, with EEH we have very specific FW interface (including low level diagnostic data "blobs" etc...) that we need to implement so I am not sure we can "fit" it in a generic interface. We might want to attempt to list our interfaces and see. CC'ing Gavin who is our EEH expert. A rough cut is - We have the concept of "frozen" PE (frozen MMIO and frozen DMA, two separate attributes). So we can query and change the freeze state on a PE. - We have various level of error severity (from informational (corrected) to full PHB need a reset with some variations in the middle of PEs being frozen etc...) - We have diag blobs coming out of FW for the guest to log (and possibly partially decode). - We have interfaces for various types of resets > > I need to spend more time reading your proposal and see how it can work > > for us (or not...)... but we might end up just doing our own thing that > > carries the whole EEH API instead. > > Obviously it would be great if it could work for you, but regardless of > whether you intend to use it I'd appreciate feedback. Thanks, Ben. > Alex > > > > --- > > > Mechanism to do PCI hot resets through VFIO: > > > > > > VFIO is fundamentally an IOMMU group and device level interface. > > > There's no concept of buses, slots, or hierarchies of devices. There > > > are only IOMMU group and devices. A bus (or slot) may contain exactly > > > one IOMMU group, multiple IOMMU groups, or a portion of an IOMMU group. > > > An IOMMU group may contain one or more devices. > > > > > > The first question is perhaps where should we create the interface to do > > > a PCI hot reset. Assuming an ioctl interface, our choices are the > > > group, the container, or the device file descriptors. Groups and > > > containers are not PCI specific, so an extension on either of those > > > doesn't make much sense. They also don't have much granularity if your > > > goal is to do a hot reset on the smallest subset of devices you can. > > > Therefore the only choice seems to be a VFIO device level interface. > > > > > > The fact that a hot reset affects multiple devices also raises concerns. > > > How do we make sure a user has sufficient access/privilege to perform > > > this operation? If all of the affected devices are within the same > > > group, then we know the user already "owns" all those devices. Using > > > groups as the boundary excludes a number of use cases though. The user > > > would need to prove that they also own the other groups or devices that > > > are affected by the reset. This might be multiple groups, so the ioctl > > > quickly grows to requiring a list of file descriptors be passed for > > > validation. > > > > > > We already use the group file descriptor as a unit of ownership for > > > enabling the container, so it seems like it would make sense to use it > > > here too. The alternative is a device file descriptor, but groups may > > > encompass devices the user doesn't care to use and we don't want to > > > require that they open a file descriptor simply to perform a hot reset. > > > Groups can also contain devices that the user cannot open, for instance > > > those owned by VFIO "compatible" drivers like pci-stub or pcieport. > > > > > > The user also needs to know the set of devices affected by a hot reset, > > > otherwise they have no idea which group file descriptors to pass to such > > > an interface. That implies we also need a separate "info" ioctl for the > > > user to learn that information. We could argue that the user could > > > learn this information from sysfs, but that imposes non-trivial library > > > or code overhead on the user to evaluate the topology. The PCI hot > > > reset info ioctl would need to indicate whether a hot reset is > > > available, and the set of affected devices. It may be useful to provide > > > this as a {group, device} pair so the user doesn't need to > > > cross-reference each device with sysfs to determine the group for the > > > device. This would then provide both the set of groups required to > > > perform the hot reset and the set of devices affected by the hot reset. > > > > > > As an alternative, we could consider simply requiring that all of the > > > devices affected by a hot reset belong to the same VFIO container. > > > However, allowing multiple groups per container is an optional IOMMU > > > capability that really has no relation to PCI bus/slot boundaries. It > > > seems a bit arbitrary to require groups be placed in the same container > > > to get a PCI hot reset. That likely means we'd still need to support > > > passing some kind of ownership token as above with groups. So it > > > doesn't seem to make the situation any better. > > > > > > Given the above discussion, I therefore propose the following PCI hot > > > reset interface: > > > > > > /** > > > * VFIO_DEVICE_PCI_HOT_RESET_INFO - _IOWR(VFIO_TYPE, VFIO_BASE + ??, > > > * struct vfio_device_pci_hot_reset_info) > > > */ > > > > > > #define VFIO_DEVICE_PCI_HOT_RESET_INFO _IO(VFIO_TYPE, VFIO_BASE + ??) > > > > > > struct vfio_device_pci_hot_reset_info_entry { > > > __u32 group_id; > > > __u16 segment; /* A reset will never include devices on other segments... return it anyway */ > > > __u8 bus; > > > __u8 devfn; /* Use PCI_SLOT/PCI_FUNC */ > > > }; > > > > > > struct vfio_device_pci_hot_reset_info { > > > __u32 argsz; > > > __u32 flags; > > > #define VFIO_PCI_HOT_RESET_SUPPORTED (1 << 0) /* Device supports hot reset */ > > > #define VFIO_PCI_HOT_RESET_POPULATED (1 << 1) /* Entries field are populated */ > > > __u32 count; > > > struct vfio_device_pci_hot_reset_info_entry entries[]; > > > }; > > > > > > The user calls VFIO_DEVICE_PCI_HOT_RESET_INFO on a VFIO device file > > > descriptor with a struct vfio_device_pci_hot_reset_info data structure, > > > minimally the sizeof the struct with argsz set. VFIO returns whether > > > hot reset is supported and the number of devices affected by the reset. > > > If argsz is big enough, VFIO will fill in the entries and set the > > > populated flag, otherwise the caller can reallocate the structure and > > > try again. > > > > > > /** > > > * VFIO_DEVICE_PCI_HOT_RESET - _IOW(VFIO_TYPE, VFIO_BASE + ??, > > > * struct vfio_device_pci_hot_reset) > > > */ > > > > > > #define VFIO_DEVICE_PCI_HOT_RESET _IO(VFIO_TYPE, VFIO_BASE + ??) > > > > > > struct vfio_device_pci_hot_reset { > > > __u32 argsz; > > > __u32 flags; > > > __u32 count; > > > __u32 fds[]; > > > }; > > > > > > As above, the user calls VFIO_DEVICE_PCI_HOT_RESET on a VFIO device file > > > descriptor. Using the list from PCI_HOT_RESET_INFO, the user allocates > > > a struct vfio_device_pci_hot_reset of sufficient size to pass a list of > > > VFIO group file descriptors. There should be one file descriptor for > > > each group listed in the info entries. If the list of groups matches > > > those affected by a hot reset of the device, then VFIO will perform the > > > hot reset action and return success. > > > > > > -- > > > To unsubscribe from this list: send the line "unsubscribe kvm" in > > > the body of a message to majordomo@xxxxxxxxxxxxxxx > > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > > > -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html