[Appears the list got dropped, replying to my previous message to re-add] On Tue, 11 Apr 2023 13:32:16 -0300 Jason Gunthorpe <jgg@xxxxxxxxxx> wrote: > On Tue, Apr 11, 2023 at 09:54:17AM -0600, Alex Williamson wrote: > > On Tue, 11 Apr 2023 10:24:58 -0300 > > Jason Gunthorpe <jgg@xxxxxxxxxx> wrote: > > > > > On Thu, Apr 06, 2023 at 11:53:47AM -0600, Alex Williamson wrote: > > > > > > > Where whether a device is opened is subject to change outside of the > > > > user's control. This essentially allows the user to perform hot-resets > > > > of devices outside of their ownership so long as the device is not > > > > used elsewhere, versus the current requirement that the user own all the > > > > affected groups, which implies device ownership. It's not been > > > > justified why this feature needs to exist, imo. > > > > > > The cdev API doesn't have the notion that owning a group means you > > > "own" some collection of devices. It still happens as a side effect, > > > but it isn't obviously part of the API. I'm really loath to > > > re-introduce that group-based concept just for this. We are trying > > > reduce the group API surface. > > > > > > How about a different direction. > > > > > > We add a new uAPI for cdev mode that is "take ownership of the reset > > > group". Maybe it can be a flag in during bind. > > > > > > When requested vfio will ensure that every device in the reset group > > > is only bound to this iommufd_ctx or left closed. Now and in the > > > future. Since no-iommu has no iommufd_ctx this means we can open only > > > one device in the reset group. > > > > > > With this flag RESET is guaranteed to always work by definition. > > > > > > We continue with the zero-length FD, but we can just replace the > > > security checks with a check if we are in reset group ownership mode. > > > > > > _INFO is unchanged. > > > > > > We decide if we add a new IOCTL to return the BDF so the existing > > > _INFO can get back to the dev_id or a new IOCTL that returns the > > > dev_id list of the reset group. > > > > > > Userspace is required to figure out the extent of the reset, but we > > > don't require that userspace prove to the kernel it did this when > > > requesting the reset. > > > > Take for example a multi-function PCIe device with ACS isolation between > > functions, are you going to allow a user who has only been granted > > ownership of a subset of functions control of the entire dev_set? > > Our cdev model says that opening a cdev locks out other cdevs from > independent use, eg because of the group sharing. Extending this to > include the reset group as well seems consistent. The DMA ownership model based on the IOMMU group is consistent with legacy vfio, but now you're proposing a new ownership model that optionally allows a user to extend their ownership, opportunistically lock out other users, and wreaking havoc for management utilities that also have no insight into dev_sets or userspace driver behavior. > There is some security concern here, but that goes both ways, a 3rd > party should not be able to break an application that needs to use > this RESET and had sufficient privileges to assert an ownership. There are clearly scenarios we have now that could break. For example, today if QEMU doesn't own all the IOMMU groups for a mult-function device, it can't do a reset, the remaining functions are available for other users. As I understand the proposal, QEMU now gets to attempt to claim ownership of the dev_set, so it opportunistically extends its ownership and may block other users from the affected devices. Ordering makes this effectively unpredictable, if a userspace like DPDK that doesn't assert dev_set ownership is started first, QEMU can start and be denied hot-reset support. In the reverse ordering, the DPDK application can be locked out by QEMU. > I'd say anyone should be able to assert RESET ownership if, like > today, the iommufd_ctx has all the groups of the dev_set inside > it. Once asserted it becomes safe against all forms of hotplug, and > continues to be safe even if some of the devices are closed. eg hot > unplugging from the VM doesn't change the availability of RESET. > > This comes from your ask that qemu know clearly if RESET works, and it > doesn't change while qemu is running. This seems stronger and clearer > than the current implicit scheme. It also doesn't require usespace to > do any calculations with groups or BDFs to figure out of RESET is > available, kernel confirms it directly. As above, clarity and predictability seem lacking in this proposal. With the current scheme, the ownership of the affected devices is implied if they exist within an owned group, but the strength of that ownership is clear. Affected devices outside the set of owned groups says that hot-reset is unavailable without any of this "but QEMU might be able to request it" or "unless the affected device is currently unopened" variables. > > seems this proposal essentially extends the ownership model to the > > greater of the dev_set or iommu group, apparently neither of which > > are explicitly exposed to the user in the cdev API. > > IIRC the group id can be learned from sysfs before opening the cdev > file. Something like /sys/class/vfio/XX/../../iommu_group And in the passed cdev fd model... ? > We should also have an iommufd ioctl to report the "same ioas" > groupings of dev_ids to make it easy on userspace. I haven't checked > to see what the current qemu patches are doing with this.. Seems we're ignoring that no-iommu doesn't have a valid iommufd. > > How does a user determine when devices cannot be used independently > > in the cdev API? > > We have this problem right now. The only way to learn the reset group > is to call the _INFO ioctl. We could add a sysfs "pci_reset_group" > under /sys/class/vfio/XX/ if something needs it earlier. For all the complaints about complexity, now we're asking management tools to not only take into account IOMMU groups, but also reset groups, and some inferred knowledge about the application and devices to speculate whether reset group ownership is taken by a given userspace?? Thanks, Alex