On Wed, 2011-11-09 at 15:08 -0600, Christian Benvenuti (benve) wrote: <snip> > > > > + > > > > +struct vfio_group { > > > > + dev_t devt; > > > > + unsigned int groupid; > > > > > > This groupid is returned by the device_group callback you recently > > added > > > with a separate (not yet in tree) IOMMU patch. > > > Is it correct to say that the scope of this ID is the bus the iommu > > > belongs too (but you use it as if it was global)? > > > I believe there is nothing right now to ensure the uniqueness of such > > > ID across bus types (assuming there will be other bus drivers in the > > > future besides vfio-pci). > > > If that's the case, the vfio.group_list global list and the > > __vfio_lookup_dev > > > routine should be changed to account for the bus too? > > > Ops, I just saw the error msg in vfio_group_add_dev about the group > > id conflict. > > > Is that warning related to what I mentioned above? > > > > Yeah, this is a concern, but I can't think of a system where we would > > manifest a collision. The IOMMU driver is expected to provide unique > > groupids for all devices below them, but we could imagine a system that > > implements two different bus_types, each with a different IOMMU driver > > and we have no coordination between them. Perhaps since we have > > iommu_ops per bus, we should also expose the bus in the vfio group > > path, > > ie. /dev/vfio/%s/%u, dev->bus->name, iommu_device_group(dev,..). This > > means userspace would need to do a readlink of the subsystem entry > > where > > it finds the iommu_group to find the vfio group. Reasonable? > > Most probably we won't see use cases with multiple buses anytime soon, but > this scheme you proposed (with the per-bus subdir) looks good to me. Ok, I think that's easier than any scheme of trying to organize globally unique groupids instead of just bus_type unique. That makes group objects internally matched by the {groupid, bus} pair. <snip> > > > > > > I looked at how you take care of ref counts ... > > > > > > This is how the tree of vfio_iommu/vfio_group/vfio_device data > > > Structures is organized (I'll use just iommu/group/dev to make > > > the graph smaller): > > > > > > iommu > > > / \ > > > / \ > > > group ... group > > > / \ / \ > > > / \ / \ > > > dev .. dev dev .. dev > > > > > > This is how you get a file descriptor for the three kind of objects: > > > > > > - group : open /dev/vfio/xxx for group xxx > > > - iommu : group ioctl VFIO_GROUP_GET_IOMMU_FD > > > - device: group ioctl VFIO_GROUP_GET_DEVICE_FD > > > > > > Given the above topology, I would assume that: > > > > > > (1) an iommu is 'inuse' if : a) iommu refcnt > 0, or > > > b) any of its groups is 'inuse' > > > > > > (2) a group is 'inuse' if : a) group refcnt > 0, or > > > b) any of its devices is 'inuse' > > > > > > (3) a device is 'inuse' if : a) device refcnt > 0 > > > > (2) is a bit debatable. I've wrestled with this one for a while. The > > vfio_iommu serves two purposes. First, it is the object we use for > > managing iommu domains, which includes allocating domains and attaching > > devices to domains. Groups objects aren't involved here, they just > > manage the set of devices. The second role is to manage merged groups, > > because whether or not groups can be merged is a function of iommu > > domain compatibility. > > > > So if we look at "is the iommu in use?" ie. can I destroy the mapping > > context, detach devices and free the domain, the reference count on the > > group is irrelevant. The user has to have a device or iommu file > > descriptor opened somewhere, across the group or merged group, for that > > context to be maintained. A reasonable requirement, I think. > > OK, then if you close all devices and the iommu, keeping the group open > Would not protect the iommu domain mapping. This means that if you (or > A management application) need to close all devices+iommu and reopen > right away again the same devices+iommu you may get a failure on the > iommu domain creation (supposing the system goes out of resources). > Is this just a very unlikely scenario? Can you think of a use case that would require such? I can't. > I guess in this case you would simply have to avoid releasing the iommu > fd, right? Right. We could also debate whether we should drop all iommu mappings when the iommu refcnt goes to zero. We don't currently do that, but it might make sense. > > > However, if we ask "is the group in use?" ie. can I not only destroy > > the > > mappings above, but also automatically tear apart merged groups, then I > > think we need to look at the group refcnt. > > Correct. > > > There's also a symmetry factor, the group is a benign entry point to > > device access. It's only when device or iommu access is granted that > > the group gains any real power. Therefore, shouldn't that power also > > be > > removed when those access points are closed? > > > > > You have coded the 'inuse' logic with these three routines: > > > > > > __vfio_iommu_inuse, which implements (1) above > > > > > > and > > > __vfio_iommu_groups_inuse > > > > Implements (2.a) > > Yes, but for al groups at once. Right > > > __vfio_group_devs_inuse > > > > Implements (2.b) > > Yes > > > > which are used by __vfio_iommu_inuse. > > > Why don't you check the group refcnt in __vfio_iommu_groups_inuse? > > > > Hopefully explained above, but open for discussion. > > > > > Would it make sense (and the code more readable) to structure the > > > nested refcnt/inuse check like this? > > > (The numbers (1)(2)(3) refer to the three 'inuse' conditions above) > > > > > > (1)__vfio_iommu_inuse > > > | > > > +-> check iommu refcnt > > > +-> __vfio_iommu_groups_inuse > > > | > > > +->LOOP: (2)__vfio_iommu_group_inuse<--MISSING > > > | > > > +-> check group refcnt<--MISSING > > > +-> __vfio_group_devs_inuse() > > > | > > > +-> LOOP: (3)__vfio_group_dev_inuse<--MISSING > > > | > > > +-> check device refcnt > > > > We currently do: > > > > (1)__vfio_iommu_inuse > > | > > +-> check iommu refcnt > > +-> __vfio_group_devs_inuse > > | > > +->LOOP: (2.b)__vfio_group_devs_inuse > > | > > +-> LOOP: (3) check device refcnt > > > > If that passes, the iommu context can be dissolved and we follow up > > with: > > > > __vfio_iommu_groups_inuse > > | > > +-> LOOP: (2.a)__vfio_iommu_groups_inuse > > | > > +-> check group refcnt > > > > If that passes, groups can also be umerged. > > > > Is this right? > > Yes, assuming we stick to the "benign" role of groups you > described above. Ok, no change then. Thanks for looking at that so closely. <snip> > > > > +static int vfio_group_merge(struct vfio_group *group, int fd) > > > > > > The documentation in vfio.txt explains clearly the logic implemented > > by > > > the merge/unmerge group ioctls. > > > However, what you are doing is not merging groups, but rather > > adding/removing > > > groups to/from iommus (and creating flat lists of groups). > > > For example, when you do > > > > > > merge(A,B) > > > > > > you actually mean to say "merge B to the list of groups assigned to > > the > > > same iommu as group A". > > > > It's actually a little more than that. After you've merged B into A, > > you can close the file descriptor for B and access all of the devices > > for the merged group from A. > > It is actually more... > > Scenario 1: > > create_grp(A) > create_grp(B) > ... > merge_grp(A,B) > create_grp(C) > merge_grp(C,B) ... this works, right? No, but merge_grp(B,C) does. I currently require that the incoming group has no open device or iommu file descriptors and is a singular group. The device/iommu is a hard requirement since we'll be changing the iommu context and can't leave an attack window. The singular group is an implementation detail. Given the iommu/device requirement, it's just as easy for userspace to tear apart the group and pass each individually. > Scenario 2: > > create_grp(A) > create_grp(B) > fd_x = get_dev_fd(B,x) > ... > merge_grp(A,B) NAK, fails no open device test. Again, merge_grp(B,A) is supported. > create_grp(C) > merge_grp(A,C) Yep, this works. > fd_x = get_dev_fd(C,x) Yep, and if x is they same in both cases, you'll get 2 different file descriptors backed by the same device. > Those two examples seems to suggest me more of a list-abstraction than a merge abstraction. > However, if it fits into the agreed syntax/logic it is ok, as long as we document it > properly. Can you suggest documentation changes that would make this more clear? > > > For the same reason, you do not really need to provide the group you > > want > > > to unmerge from, which means that instead of > > > > > > unmerge(A,B) > > > > > > you would just need > > > > > > unmerge(B) > > > > Good point, we can avoid the awkward reference via file descriptor for > > the unmerge. > > > > > I understand the reason why it is not a real merge/unmerge (ie, to > > keep the > > > original groups so that you can unmerge later) > > > > Right, we still need to have visibility of the groups comprising the > > merged group, but the abstraction provided to the user seems to be > > deeper than you're thinking. > > > > > ... however I just wonder if > > > it wouldn't be more natural to implement the > > VFIO_IOMMU_ADD_GROUP/DEL_GROUP > > > iommu ioctls instead? (the relationships between the data structure > > would > > > remain the same) > > > I guess you already discarded this option for some reasons, right? > > What was > > > the reason? > > > > It's a possibility, I'm not sure it was discussed or really what > > advantage it provides. It seems like we'd logically lose the ability > > to > > access devices from other groups, > > What is the real (immediate) benefit of this capability? Mostly convenience, but also promotes the peer idea where merged groups simply create a "super" group that can access the iommu and all the devices of the member groups. On x86 we expect that merging groups will always succeed and groups will typically have a single device, so a driver could merge them all together, throw away all the extra group file descriptors and manage the whole super group via a single group fd. > > whether that's good or bad, I don't know. I think the notion of "merge" > > promotes the idea that the groups > > are peers and an iommu_add/del feels a bit more hierarchical. > > I agree. <snip> > > > > + if (!device) { > > > > + if (__vfio_group_devs_inuse(group) || > > > > + (group->iommu && group->iommu->refcnt)) { > > > > + printk(KERN_WARNING > > > > + "Adding device %s to group %u while group is > > > > already in use!!\n", > > > > + dev_name(dev), group->groupid); > > > > + /* XXX How to prevent other drivers from claiming? */ > > > > > > Here we are adding a device (not yet assigned to a vfio bus) to a > > group > > > that is already in use. > > > Given that it would not be acceptable for this device to get assigned > > > to a non vfio driver, why not forcing such assignment here then? > > > > Exactly, I just don't know the mechanics of how to make that happen and > > was hoping for suggestions... > > > > > I am not sure though what the best way to do it would be. > > > What about something like this: > > > > > > - when the bus vfio-pci processes the BUS_NOTIFY_ADD_DEVICE > > > notification it assigns to the device a PCI ID that will make sure > > > the vfio-pci's probe routine will be invoked (and no other driver > > can > > > therefore claim the device). That PCI ID would have to be added > > > to the vfio_pci_driver's id_table (it would be the exception to the > > > "only dynamic IDs" rule). Too hackish? > > > > Presumably some other driver also has the ID in it's id_table, how do > > we make sure we win? > > By mangling such ID (when processing the BUS_NOTIFY_ADD_DEVICE notification) to > match against a 'fake' ID registered in the vfio-pci table (it would be like a > sort of driver redirect/divert). The vfio-pci's probe routine would restore > the original ID (we do not want to confuse userspace). This is hackish, I agree. > > What about this: > - When vfio-pci processes the BUS_NOTIFY_ADD_DEVICE notification it can > pre-initialize the driver pointer (via an API). We would then need to change > the match/probe PCI mechanism too: for example, the PCI core will have to check > and honor such pre-driver-initialization when present (and give it higher > priority over the match callbacks). > How to do this? For example, when vfio_group_add_dev is invoked, it checks > whether the device is getting added to an already existent group where > the other devices (well, you would need to check just one of the devices in > the group) are already assigned to vfio-pci, and in such a case it > pre-initialize the driver to vfio-pci. It's ok to make a group "non-viable", we only want to intervene if the iommu is inuse (iommu or device refcnt > 0). > > NOTE: By "preinit" I mean "save into the device a reference to a driver before > the 'match' callbacks". > > This would be the timeline: > > | > +-> new device gets added to (PCI) bus > | > +-> PCI: send BUS_NOTIFIER_ADD_DEVICE notification > | > +-> VFIO:vfio_pci_device_notifier > | | > | +-> BUS_NOTIFIER_ADD_DEVICE: vfio_group_add_dev > | | > | +->iommu_device_group(dev,&groupid) > | +->group = <search groupid in vfio.group_list> > | +->if (group && group_is_vfio(group)) > | | <preinit device driver to vfio-pci> > | ... > | > +-> PCI: xxx > | | > | +-> if (!device_driver_is_preinit(dev)) > | | probe=<search driver's probe callback using 'match'> > | | else > | | probe=<get it from preint driver config> > | | (+fallback to 'match' if preinit driver disappeared?) > | | > | +-> rc = probe(...) > | | > | ... > v > ... > > Of course, what if multiple drivers decide to preinit the device ? Yep, we'd have to have a policy to BUG_ON if the preinit driver is already set. > One way to make it cleaner would be to: > - have the PCI layer export an API that allows (for example) the bus > notification callbacks (like vfio_pci_device_notifier) to preinit a driver > - make such API reject calls on devices that already have a preinit > driver. > - make VFIO detect the case where vfio_pci_device_notifier can not > preinit the driver (to vfio-pci) for the new device (because already > preinited) and raise an error/warning. > > Would this look a bit cleaner? It looks like there might already be infrastructure that we can set dev->driver and call the driver probe() function, so maybe we're only in trouble if dev->driver is already set when we get the bus add notification. I just wasn't sure if that was entirely kosher. I'll have to try that and figure out how to test it; fake hotplug maybe. <snip> > > > This fn below does not return any error code. Ok ... > > > However, there are a number of errors case that you test, for example > > > - device that does not belong to any group (according to iommu API) > > > - device that belongs to a group but that does not appear in the list > > > of devices of the vfio_group structure. > > > Are the above two errors checks just paranoia or are those errors > > actually possible? > > > If they were possible, shouldn't we generate a warning (most probably > > > it would be a bug in the code)? > > > > They're all vfio-bus driver bugs of some sort, so it's just a matter of > > how much we want to scream about them. I'll comments on each below. > > > > > > +void vfio_group_del_dev(struct device *dev) > > > > +{ > > > > + struct list_head *pos; > > > > + struct vfio_group *group = NULL; > > > > + struct vfio_device *device = NULL; > > > > + unsigned int groupid; > > > > + > > > > + if (iommu_device_group(dev, &groupid)) > > > > + return; > > > > Here the bus driver is probably just sitting on a notifier list for > > their bus_type and a device is getting removed. Unless we want to > > require the bus driver to track everything it's attempted to add and > > whether it worked, we can just ignore this. > > OK, I see what you mean. If vfio_group_add_dev fails for some reasons we > do not keep track of it. Right? The primary thing I'm thinking of here is not vfio_group_add_dev() failing for "some reason", but specifically failing because the device doesn't have a groupid, ie. it's not behind an iommu. In that case it's just a random device that can't be used by vfio. > Would it make sense to add one special group to vfio.group_list (or better > On a separate field of the vfio structure) whose goal > would be just that: keep track of those devices that failed to be added > to the VFIO framework (can it help for debugging too?)? For the above case, no, we shouldn't need to track those. But it does seem like there's a gap for devices that fail vfio_group_add_dev() for other reasons. I don't think we want a special group for them, because that isolates them from other devices that are potentially in the same group. I think instead what we want to do is set a taint flag on the group. We can do a BUG_ON not being able to allocate a group, then a WARN_ON if we fail elsewhere and mark the group tainted so it's effectively never viable. <snip> > > > > + if (!vfio_device_removeable(device)) { > > > > + /* XXX signal for all devices in group to be removed or > > > > + * resort to killing the process holding the device fds. > > > > + * For now just block waiting for releases to wake us. */ > > > > + wait_event(vfio.release_q, vfio_device_removeable(device)); > > > > > > Any new idea/proposal on how to handle this situation? > > > The last one I remember was to leave the soft/hard/etc timeout > > handling in > > > userspace and implement it as a sort of policy. Is that one still the > > most > > > likely candidate solution to handle this situation? > > > > I haven't heard any new proposals. I think we need the hard timeout > > handling in the kernel. We can't leave it to userspace to decide they > > get to keep the device. We could have this tunable via an ioctl, but I > > don't see how we wouldn't require CAP_SYS_ADMIN (or similar) to tweak > > it. I was intending to re-implement the netlink interface to signal > > the > > removal, but expect to get allergic reactions to that. > > (I personally like the async netlink signaling, but I am OK with an ioctl based > mechanism if it provides the same flexibility) > > What would be a reasonable hard timeout? I think we were looking at 10s of seconds in the old vfio code. Tough call though. Could potentially provide a module_param override so an admin that trusts their users could set long/infinite timeout. Thanks, Alex -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html