On Wed, 5 Apr 2023 12:56:21 -0600 Alex Williamson <alex.williamson@xxxxxxxxxx> wrote: > On Wed, 5 Apr 2023 14:23:43 -0300 > Jason Gunthorpe <jgg@xxxxxxxxxx> wrote: > > > On Wed, Apr 05, 2023 at 10:52:15AM -0600, Alex Williamson wrote: > > > On Wed, 5 Apr 2023 13:37:05 -0300 > > > Jason Gunthorpe <jgg@xxxxxxxxxx> wrote: > > > > > > > On Wed, Apr 05, 2023 at 10:25:45AM -0600, Alex Williamson wrote: > > > > > > > > > But that kind of brings to light the question of what does the user do > > > > > when they encounter this situation. > > > > > > > > What does it do now when it encounters a group_id it doesn't > > > > understand? Userspace already doesn't know if the foreign group is > > > > open or not, right? > > > > > > It's simple, there is currently no screwiness around opened devices. > > > If the caller doesn't own all the groups mapping to the affected > > > devices, hot-reset is not available. > > > > That still has nasty edge cases. If the reset group spans beyond a > > single iommu group you end up with qemu being unable to operate reset > > at all, and it is unfixable from an API perspective as we can't pass > > in groups that VFIO isn't going to use. > > Hmm, s/nasty/niche/? Yes, QEMU currently has no way to own a group > without assigning a device from the group, but technically that could > be fixed within QEMU. If QEMU doesn't own that affected group, then it > can't very well count on that group to not be used in some other way > when it comes time to actually do a hot-reset. > > > I think you are right, the fact we'd have to return -1 dev_ids to this > > modified API is pretty damaging, it doesn't seem like a good > > direction. > > > > > This leads to scenarios where the info ioctl indicates a hot-reset is > > > initially available, perhaps only because one of the affected devices > > > was not opened at the time, and now it fails when QEMU actually tries > > > to use it. > > > > I would like it if the APIs toward the kernel were only about the > > kernel's security apparatus. It is makes it easier to reason about the > > kernel side and gives nice simple well defined APIs. > > Usability needs to be a consideration as well. An interface where the > result is effectively arbitrary from a user perspective because the > kernel is solely focused on whether the operation is allowed, > evaluating constraints that the user is unaware of and cannot control, > is unusable. > > > This is a good point that qemu needs to make a policy decision if it > > is happy about the VFIO configuration - but that is a policy decision > > that should not become entangled with the kernel's security checks. > > > > Today qemu can make this policy choice the same way it does right now > > - call _INFO and check the group_ids. It gets the exact same outcome > > as today. We already discussed that we need to expose the group ID > > through an ioctl someplace. > > QEMU can make a policy decision today because the kernel provides a > sufficiently reliable interface, ie. based on the set of owned groups, a > hot-reset is all but guaranteed to work. If we focus only on whether a > given reset is allowed from a kernel perspective and ignore that > userspace needs some predictability of the kernel behavior, then QEMU > cannot reasonable make that policy decision. > > > If this is too awkward we could add a query to the kernel if the cdev > > is "reset exclusive" - eg the iommufd covers all the groups that span > > the reset set. > > That's essentially what we have if there are valid dev-ids for each > affected device in the info ioctl. I don't think it helps the user > experience to create loopholes where the hot-reset ioctl can still work > in spite of those missing devices. The group interface uses the fact > that ownership of the group implies ownership of all devices within the > group such that the user only needs to prove group ownership. > > But we still have underlying groups even with the cdev model, with the > same ownership principles, so don't we just need to prove group > ownership based on a device fd rather than a group fd? > > For example, we have a VFIO_DEVICE_GET_INFO ioctl that supports > capability chains, we could add a capability that reports the group ID > for the device. The hot-reset info ioctl remains as it is today, > reporting group-ids and bdfs. The hot-reset ioctl itself is modified to > transparently support either group fds or device fds. The user can now > map cdevs to group-ids and therefore follow the same rules as groups, > providing at least one representative device fd for each group. We've > essentially already enabled this by allowing the limit of user provided > fds equal to the number of affected devices. If I'm not mistaken, I think this resolves cdev no-iommu to work equivalently to groups as well. Thanks, Alex