On Wed, 5 Apr 2023 14:23:43 -0300 Jason Gunthorpe <jgg@xxxxxxxxxx> wrote: > On Wed, Apr 05, 2023 at 10:52:15AM -0600, Alex Williamson wrote: > > On Wed, 5 Apr 2023 13:37:05 -0300 > > Jason Gunthorpe <jgg@xxxxxxxxxx> wrote: > > > > > On Wed, Apr 05, 2023 at 10:25:45AM -0600, Alex Williamson wrote: > > > > > > > But that kind of brings to light the question of what does the user do > > > > when they encounter this situation. > > > > > > What does it do now when it encounters a group_id it doesn't > > > understand? Userspace already doesn't know if the foreign group is > > > open or not, right? > > > > It's simple, there is currently no screwiness around opened devices. > > If the caller doesn't own all the groups mapping to the affected > > devices, hot-reset is not available. > > That still has nasty edge cases. If the reset group spans beyond a > single iommu group you end up with qemu being unable to operate reset > at all, and it is unfixable from an API perspective as we can't pass > in groups that VFIO isn't going to use. Hmm, s/nasty/niche/? Yes, QEMU currently has no way to own a group without assigning a device from the group, but technically that could be fixed within QEMU. If QEMU doesn't own that affected group, then it can't very well count on that group to not be used in some other way when it comes time to actually do a hot-reset. > I think you are right, the fact we'd have to return -1 dev_ids to this > modified API is pretty damaging, it doesn't seem like a good > direction. > > > This leads to scenarios where the info ioctl indicates a hot-reset is > > initially available, perhaps only because one of the affected devices > > was not opened at the time, and now it fails when QEMU actually tries > > to use it. > > I would like it if the APIs toward the kernel were only about the > kernel's security apparatus. It is makes it easier to reason about the > kernel side and gives nice simple well defined APIs. Usability needs to be a consideration as well. An interface where the result is effectively arbitrary from a user perspective because the kernel is solely focused on whether the operation is allowed, evaluating constraints that the user is unaware of and cannot control, is unusable. > This is a good point that qemu needs to make a policy decision if it > is happy about the VFIO configuration - but that is a policy decision > that should not become entangled with the kernel's security checks. > > Today qemu can make this policy choice the same way it does right now > - call _INFO and check the group_ids. It gets the exact same outcome > as today. We already discussed that we need to expose the group ID > through an ioctl someplace. QEMU can make a policy decision today because the kernel provides a sufficiently reliable interface, ie. based on the set of owned groups, a hot-reset is all but guaranteed to work. If we focus only on whether a given reset is allowed from a kernel perspective and ignore that userspace needs some predictability of the kernel behavior, then QEMU cannot reasonable make that policy decision. > If this is too awkward we could add a query to the kernel if the cdev > is "reset exclusive" - eg the iommufd covers all the groups that span > the reset set. That's essentially what we have if there are valid dev-ids for each affected device in the info ioctl. I don't think it helps the user experience to create loopholes where the hot-reset ioctl can still work in spite of those missing devices. The group interface uses the fact that ownership of the group implies ownership of all devices within the group such that the user only needs to prove group ownership. But we still have underlying groups even with the cdev model, with the same ownership principles, so don't we just need to prove group ownership based on a device fd rather than a group fd? For example, we have a VFIO_DEVICE_GET_INFO ioctl that supports capability chains, we could add a capability that reports the group ID for the device. The hot-reset info ioctl remains as it is today, reporting group-ids and bdfs. The hot-reset ioctl itself is modified to transparently support either group fds or device fds. The user can now map cdevs to group-ids and therefore follow the same rules as groups, providing at least one representative device fd for each group. We've essentially already enabled this by allowing the limit of user provided fds equal to the number of affected devices. Does that work? Thanks, Alex