On Tue, 2012-03-27 at 16:14 +1100, David Gibson wrote: > On Wed, Mar 21, 2012 at 03:12:58PM -0600, Alex Williamson wrote: > > On Sat, 2012-03-17 at 15:57 +1100, David Gibson wrote: > > > On Fri, Mar 16, 2012 at 01:31:18PM -0600, Alex Williamson wrote: > > > > On Fri, 2012-03-16 at 14:45 +1100, David Gibson wrote: > > > > > On Thu, Mar 15, 2012 at 02:15:01PM -0600, Alex Williamson wrote: > > > > > > On Wed, 2012-03-14 at 20:58 +1100, David Gibson wrote: > > > > > > > On Tue, Mar 13, 2012 at 10:49:47AM -0600, Alex Williamson wrote: > > > > > > > > On Wed, 2012-03-14 at 01:33 +1100, David Gibson wrote: > > > > > > > > > On Mon, Mar 12, 2012 at 04:32:54PM -0600, Alex Williamson wrote: > > > > > > > > > > +/* > > > > > > > > > > + * Add a device to an isolation group. Isolation groups start empty and > > > > > > > > > > + * must be told about the devices they contain. Expect this to be called > > > > > > > > > > + * from isolation group providers via notifier. > > > > > > > > > > + */ > > > > > > > > > > > > > > > > > > Doesn't necessarily have to be from a notifier, particularly if the > > > > > > > > > provider is integrated into host bridge code. > > > > > > > > > > > > > > > > Sure, a provider could do this on it's own if it wants. This just > > > > > > > > provides some infrastructure for a common path. Also note that this > > > > > > > > helps to eliminate all the #ifdef CONFIG_ISOLATION in the provider. Yet > > > > > > > > to be seen whether that can reasonably be the case once isolation groups > > > > > > > > are added to streaming DMA paths. > > > > > > > > > > > > > > Right, but other than the #ifdef safety, which could be achieved more > > > > > > > simply, I'm not seeing what benefit the infrastructure provides over > > > > > > > directly calling the bus notifier function. The infrastructure groups > > > > > > > the notifiers by bus type internally, but AFAICT exactly one bus > > > > > > > notifier call would become exactly one isolation notifier call, and > > > > > > > the notifier callback itself would be almost identical. > > > > > > > > > > > > I guess I don't see this as a fundamental design point of the proposal, > > > > > > it's just a convenient way to initialize groups as a side-band addition > > > > > > until isolation groups become a more fundamental part of the iommu > > > > > > infrastructure. If you want to do that level of integration in your > > > > > > provider, do it and make the callbacks w/o the notifier. If nobody ends > > > > > > up using them, we'll remove them. Maybe it will just end up being a > > > > > > bootstrap. In the typical case, yes, one bus notifier is one isolation > > > > > > notifier. It does however also allow one bus notifier to become > > > > > > multiple isolation notifiers, and includes some filtering that would > > > > > > just be duplicated if every provider decided to implement their own bus > > > > > > notifier. > > > > > > > > > > Uh.. I didn't notice any filtering? That's why I'm asking. > > > > > > > > Not much, but a little: > > > > > > > > + switch (action) { > > > > + case BUS_NOTIFY_ADD_DEVICE: > > > > + if (!dev->isolation_group) > > > > + blocking_notifier_call_chain(¬ifier->notifier, > > > > + ISOLATION_NOTIFY_ADD_DEVICE, dev); > > > > + break; > > > > + case BUS_NOTIFY_DEL_DEVICE: > > > > + if (dev->isolation_group) > > > > + blocking_notifier_call_chain(¬ifier->notifier, > > > > + ISOLATION_NOTIFY_DEL_DEVICE, dev); > > > > + break; > > > > + } > T> > > > > > > > Ah, I see, fair enough. > > > > > > A couple of tangential observations. First, I suspect using > > > BUS_NOTIFY_DEL_DEVICE is a very roundabout way of handling hot-unplug, > > > it might be better to have an unplug callback in the group instead. > > > > There's really one already. Assuming the device is attached to a group > > driver, the .remove entry point on the driver will get called first. It > > doesn't get to return error, but it can block. > > Hrm. Assuming the isolation provider has installed a driver for the > relevant bus type. And as we're discussing elsewhere, I think we have > situations where the groups will get members on (subordinate) busses > which the isolation provider doesn't have explicit knowledge of. > > > The DEL_DEVICE will only > > happen after the group driver has relinquished the device, or at least > > returned from remove(). For a device with no driver, the group would > > only learn about it through the notifier, but it probably doesn't need > > anything more direct. > > DEL_DEVICE is certainly sufficient, it just seems a bit unnecessarily > awkward. > > > > Second, I don't think aborting the call chain early for hot-plug is > > > actually a good idea. I can't see a clear guarantee on the order, so > > > individual providers couldn't rely on that short-cut behaviour. Which > > > means that if two providers would have attempted to claim the same > > > device, something is seriously wrong and we should probably report > > > that. > > > > Yeah, that seems like a reasonable safety check. > > > > > > > > > > > So, somewhere, I think we need a fallback path, but I'm not sure > > > > > > > > > exactly where. If an isolation provider doesn't explicitly put a > > > > > > > > > device into a group, the device should go into the group of its parent > > > > > > > > > bridge. This covers the case of a bus with IOMMU which has below it a > > > > > > > > > bridge to a different type of DMA capable bus (which the IOMMU isn't > > > > > > > > > explicitly aware of). DMAs from devices on the subordinate bus can be > > > > > > > > > translated by the top-level IOMMU (assuming it sees them as coming > > > > > > > > > from the bridge), but they all need to be treated as one group. > > > > > > > > > > > > > > > > Why would the top level IOMMU provider not set the isolation group in > > > > > > > > this case. > > > > > > > > > > > > > > Because it knows nothing about the subordinate bus. For example > > > > > > > imagine a VT-d system, with a wacky PCI card into which you plug some > > > > > > > other type of DMA capable device. The PCI card is acting as a bridge > > > > > > > from PCI to this, let's call it FooBus. Obviously the VT-d code won't > > > > > > > have a FooBus notifier, since it knows nothing about FooBus. But the > > > > > > > FooBus devices still need to end up in the group of the PCI bridge > > > > > > > device, since their DMA operations will appear as coming from the PCI > > > > > > > bridge card to the IOMMU, and can be isolated from the rest of the > > > > > > > system (but not each other) on that basis. > > > > > > > > > > > > I guess I was imagining that it's ok to have devices without an > > > > > > isolation group. > > > > > > > > > > It is, but having NULL isolation group has a pretty specific meaning - > > > > > it means it's never safe to give that device to userspace, but it also > > > > > means that normal kernel driver operation of that device must not > > > > > interfere with anything in any iso group (otherwise we can never no > > > > > that those iso groups _are_ safe to hand out). Likewise userspace > > > > > operation of any isolation group can't mess with no-group devices. > > > > > > > > This is where wanting to use isolation groups as the working unit for an > > > > iommu ops layer and also wanting to use iommu ops to replace dma ops > > > > seem to collide a bit. Do we want two interfaces for dma, one group > > > > based and one for non-isolated devices? > > > > > > Well, I don't know enough about what dwmw2 had planned to answer > > > that. I was assuming the in-kernel dma reply would remain device/bus > > > focussed, and how it looked up and used the groups was an internal > > > matter. I would expect it would probably need a fallback path for "no > > > group" for transition purposes, even if it wasn't planned to keep that > > > forever. > > > > Hmm, I think the value of a core isolation group layer starts to fall > > apart if an iommu driver can't count on a group being present for any > > device that might do dma. Some ideas below. > > Ah, yes, to clarify: I think anything subject to IOMMU translation > should have a group (which might need a no-manager flag). However, > for devices with no IOMMU, or with an IOMMU we can switch permanently > into bypass mode, I think it's reasonable to leave them without group. > > > > > Isolation providers like > > > > intel-iommu would always use one, non-isolation capable dma paths, like > > > > swiotlb or non-isolation hardware iommus, would use another. And how do > > > > we factor in the degree of isolation to avoid imposing policy in the > > > > kernel? MSI isolation is an example. We should allow userspace to set > > > > a policy of whether lack of MSI protection is an acceptable risk. Does > > > > that mean we can have isolation groups that provide no isolation and > > > > sysfs files indicating capabilities? Perhaps certain classes of groups > > > > don't even allow manager binding? > > > > > > Ugh. I thought about flagging groups with some sort of bitmap of > > > isolation strength properties, but I think that way lies madness. We > > > might want a global policy bitmask though which expresses which > > > constraints are acceptable risks. The providers would then have to > > > provide groups which are as strong as requested, or none at all. > > > > That's effectively how the current iommu_device_group() interface works; > > identify isolate-able groups, or none at all. I don't think we get that > > luxury if we push isolation groups down into the device layer though. > > If we want to use them for dma_ops and to solve various device quirks, > > they can't only be present on some configurations. I think you're right > > on the global policy though, we just need to apply it differently. I'm > > thinking we need something like "isolation_allow=" allowing options of > > "nomsi" and "nop2p" (just for starters). When a provider creates a > > group, they provide a set of flags for the group. A manager is not > > allowed to bind to the group if the global policy doesn't match the > > group flags. We'll need some support functions, maybe even a sysfs > > file, so userspace can know in advance and vfio can avoid creating > > entries for unusable groups. > > Just banning managers for groups of insufficient strength isn't quite > enough, because that doesn't allow for consolidation of several > poorly-isolated groups into one strongly isolated groups which may be > possible on some hardware configurations. This is the part where things start to completely fall apart. > > > > > None of these conditions is true for the hypothetical Foobus case. > > > > > The bus as a whole could be safely given to userspace, the devices on > > > > > it *could* mess with an existing isolation group (suppose the group > > > > > consisted of a PCI segment with the FooBus bridge plus another PCI > > > > > card - FooBus DMAs would be bridged onto the PCI segment and might > > > > > target the other card's MMIO space). And other grouped devices can > > > > > certainly mess with the FooBus devices (if userspace grabs the bridge > > > > > and manipulates its IOMMU mappings, that would clearly screw up any > > > > > kernel drivers doing DMA from FooBus devices behind it). > > > > > > > > ---segment-----+---------------+ > > > > | | > > > > [whackyPCIcard] [other device] > > > > | > > > > +---FooBus---+-------+ > > > > | | > > > > [FooDev1] [FooDev2] > > > > > > > > This is another example of the quality of the isolation group and what > > > > factors we incorporate in judging that. If the bridge/switch port > > > > generating the segment does not support or enable PCI ACS then the IOMMU > > > > may be able to differentiate whackyPCIcard from the other device but not > > > > prevent peer-to-peer traffic between the two (or between FooDevs and the > > > > other device - same difference). This would suggest that we might have > > > > an isolation group at the root of the segment for which the provider can > > > > guarantee isolation (includes everything on and below the bus), and we > > > > might also have isolation groups at whackyPCIcard and the other device > > > > that have a difference quality of isolation. /me pauses for rant about > > > > superior hardware... ;) > > > > > > Nested isolation groups? Please no. > > > > > > An isolation group at the bridge encompassing that segment was exactly > > > what I had in mind, but my point is that all the FooBus devices *also* > > > need to be in that group, even though the isolation provider knows > > > nothing about FooBus. > > > > If we're able to strictly say "no isolation = no group" and ignore > > FooBus for a moment, yes, the group at the bridge makes sense. But as I > > said, I don't think we get that luxury. There will be different > > "acceptable" levels of isolation and we'll need to support both the > > model of single group encompassing the segment as well as separate > > isolation groups for each device, whackyPCICard/other. > > Well, sure. But in that case, the FooBus devices still need to live > in the same group as their bridge card, since they're subject to the > same translations. > > > A question then is how do we support both? Perhaps it's a provider > > option whether to only create fully isolated group, which makes it > > traverse up to the segment. The default might be to make separate > > groups at whackyPCICard and other, each with the nop2p flag. It gets a > > lot more complicated, but maybe we support both so if system policy > > prevents a manager from binding to a sub-group due to nop2p, it can walk > > up a level. I suspect dma_ops always wants the closer group to avoid > > traversing levels. > > Yeah. I really wish I knew more about what dwmw2 had in mind. > > > Back to FooBus, dma_ops only knows how to provide dma for certain types > > of devices. intel-iommu only knows about struct pci_dev. So if FooDev > > needed to do dma, it would need some kind of intermediary to do the > > mapping via whackyPCICard. So from that perspective, we could get away > > w/o including FooBus devices in the group. > > That seems very fragile Logically the FooBus devices should be in the > same group. Yes, the FooBus dma hooks will need to know how to find > the bridge PCI card and thereby manipulate it's IOMMU mappings, but > they still belong in the same isolation group. > > > Of course if we want a > > manager to attach to the group at whackyPCICard (ignoring nop2p), we > > need to put requirements on the state of the FooDevs. I think that > > might actually give some credibility to nested groups, hierarchical > > really, ie if we need to walk all the devices below a group, why not > > allow groups below a group? I don't like it either, but I'm not sure > > it's avoidable. > > Urgh. Nested groups, a group parent must always have a strictly > stronger set of isolation properties than subgroups. I suppose that > could work, it's really setting off my over-engineering alarms, > though. This entire thing is over engineered, I think it's time to take a step back. We're combining iommu visibility with device quirks with isolation strength and hierarchical groups with manager locking and driver autoprobing... it's all over the place. > > > > > Oh.. and I just thought of an existing-in-the-field case with the same > > > > > problem. I'm pretty sure for devices that can appear on PCI but also > > > > > have "dumb-bus" versions used on embedded systems, at least some of > > > > > the kernel drivers are structured so that there is a separate struct > > > > > device for the PCI "wrapper" and the device inside. If the inner > > > > > driver is initiating DMA, as it usually would, it has to be in the > > > > > same iso group as it's PCI device parent. > > > > > > > > I don't know that we can ever generically identify such devices, but > > > > maybe this introduces the idea that we need to be able to blacklist > > > > certain devices as multi-homed and tainting any isolation group that > > > > contains them. > > > > > > Sorry, I was unclear. These are not multi-homed devices, just > > > multiple-variant devices, any given instance will either be PCI or > > > not. But to handle the two variants easily, the drivers have nested > > > struct devices. It's essentially a pure software artifact we're > > > dealing with here, but nonetheless we need to get the group-ownership > > > of those struct devices correct, whether they're synthetic or not. > > > > So long as the ADD_DEVICE happens with the currently running variant, > > which seems like it has to be the case, I'm not sure how this matters. > > I'm trying to be very careful to only use struct device for groups. > > Nope, I still haven't managed to convey the situation. The *hardware* > comes in two varants, PCI and "dumb bus". The core part of the driver > software binds (only) to a dumb bus struct device (usually a platform > device, actually). > > To support the PCI variant, there is a "wrapper" driver. This is a > PCI driver which does any PCI specific initialization, determines the > register addresses from config space then creates a dumb bus struct > device as a child of the pci_dev. The core driver then attaches to > that dumb bus (e.g. platform) device. > > My point is that in the second case, the dumb bus struct device needs > to be in the same group as its pci_dev parent. > > > > > > > When that happens we can traverse up the bus to find a > > > > > > higher level isolation group. > > > > > > > > > > Well, that's one possible answer to my "where should the hook be > > > > > question": rather than an initialization hook, when we look up a > > > > > device's isolation group, if it doesn't say one explicitly, we try > > > > > it's bridge parent and so forth up the tree. I wonder about the > > > > > overhead of having to walk all the way up the bus heirarchy before > > > > > returning NULL whenever we ask about the group of a device that > > > > > doesn't have one. > > > > > > > > Yep, that could definitely be a concern for streaming dma. > > > > > > Right, which is why I'm suggesting handling at init time instead. > > > We'd need something that runs in the generic hot-add path, after bus > > > notifiers, which would set the device's group to the same as its > > > parent's if a provider hasn't already given it one. > > > > I don't think that works. Back to the FooBus example, if the isolation > > group becomes a fundamental part of the dma_ops path, we may end up with > > groups at each FooDev (or maybe one for FooBus) generated by the > > intermediary that sets up dma using whackyPCICard. > > That's why this would only do something *iff* nothing else has set a > group (like that intermediary). > > > > > > > It would probably make sense to have some > > > > > > kind of top-most isolation group that contains everything as it's an > > > > > > easy proof that if you own everything, you're isolated. > > > > > > > > > > Hrm, no. Things that have no IOMMU above them will have ungated > > > > > access to system RAM, and so can never be safely isolated for > > > > > userspace purposes, even if userspace owned every _device_ in the > > > > > system (not that you could do that in practice, anyway). > > > > > > > > RAM is a good point, there are "non-devices" to worry about. > > > > Potentially that top level group doesn't allow managers. > > > > > > I don't really see the point of a "top-level" group. Groups are > > > exclusive, not heirarchical, and I don't see that there are any paths > > > really simplified by having a (not manageable) "leftovers" group. > > > > Yeah, a top level group may not make sense, but if we plan for dma_ops, > > we need non-manageable groups, which I think is just the degenerate case > > of isolation quality. > > Yeah, I'll buy that. Pending more details on dwmw2's plans, anyway. > > > > > > > Potentially > > > > > > though, wackyPCIcard can also register isolation groups for each of it's > > > > > > FooBus devices if it's able to provide that capability. Thanks, > > > > > > > > > > It could, but why should it. It doesn't know anything about IOMMUs or > > > > > isolation, and it doesn't need to. Even less so for PCI devices which > > > > > create subordinate non-PCI struct devices for internal reasons. > > > > > > > > Sorry I wasn't clear. If wackyPCIcard happens to include an onboard > > > > IOMMU of some sort that's capable of isolation, it might be able to > > > > register groups for each of the devices below it. Therefore we could > > > > end up with a scenario like above that the segment may not have ACS and > > > > therefore not be able to isolate wackyPCIcard from otherdevice, but > > > > wackyPCIcard can isolate FooDev1 from FooDev2 and otherdevice. I think > > > > that means we potentially need to check all devices downstream of an > > > > isolation group to allow a manager to lock it, as well as checking the > > > > path upstream to make sure it isn't used above... Are you sure we're > > > > ready for this kind of infrastructure? Thanks, > > > > > > Hrm, at this stage I think I'm prepared to assume that IOMMUs are > > > either strictly independent (e.g. covering different PCI domains) or > > > are at least vaguely aware of each other (i.e. platform conventions > > > cover the combination). Truly nested IOMMUs, that are entirely > > > unaware of each other is a problem for another day (by which I mean > > > probably never). > > > > Ok, but even if there's no iommu onboard whackyPCICard, it can still be > > possible to create groups on FooBus just to facilitate dma_ops and > > handle quirks. > > No, you can't arbitrarily go creating groups - the group boundary > means something specific. If dma_ops or something needs more info, it > will need that *in addition* to the group. > > > > That said, again, groups have to be exclusive, not heirarchical, so > > > in the above case, I think we'd have a few possible ways to go: > > > > Ownership absolutely has to be exclusive, but it looks like groups > > themselves are hierarchical. > > Doesn't really make sense, since a group is defined as the minimum > isolatable parcel. You can do something if as you walk down the tree > isolation level decreases, but it's certainly not feeling right. > > > > 1) The FooBus isolation provider could refuse to initialize if > > > the FooBus bridge is sharing an iso group with something else before > > > it probes the FooBus. > > > > Sharing, as in a parent group is managed? > > No, just as in a group exists which contains both the FooBus bridge > and something else. Basically this option means that if you plugged > the FooBus bridge into a slot that wasn't sufficiently isolated, it > would simply refuse to work. > > > Yes, something special needs > > to happen in that case, I'd probably investigate having it initialized > > in a managed state versus not initialized at all. > > > > > 2) The FooBus isolation provider could be in the FooBus bridge > > > driver - meaning that if someone tried to grab the PCI group including > > > the FooBridge, the provider driver would be booted out, causing the > > > FooBus groups to cease to exist (meaning the PCI group manager would > > > never get lock while someone was already using the FooGroups). In > > > > Right, I don't know that they need to cease to exist, but if a manager > > is attached anywhere above or below the group it wants to lock, it has > > to be blocked. > > > > > this case, it gets a bit complex. When the FooBus isolation provider > > > is active, the FooBus devices would be in their own groups, not the > > > group of the FooBridge and its sibling. When the FooBus isolation > > > provider is removed, it would have to configure the FooBus IOMMU to a > > > passthrough mode, and revert the FooBus devices to the parent's > > > group. Hm. Complicated. > > > > Yep. I think we're arriving at the same point. Groups are > > hierarchical, but ownership via a manager cannot be nested. So to > > manage a group, we need to walk the entire tree of devices below each > > device checking that none of the groups are managed and all the devices > > are using the right driver, then walk up from the group to verify no > > group of a parent device is managed. Thanks, > > Blargh. I really, really hope we can come up with a simpler model > than that. Yep, I'm pretty well at the end of this experiment. Honestly, I think isolation groups are the wrong approach. We're trying to wrap too many concepts together and it's completely unmanageable. I cannot see adding the complexity we're talking about here to the core device model for such a niche usage. I think we're better off going back to the iommu_device_group() and building that out into something more complete, starting with group based iommu ops and a dma quirk infrastructure. >From there we can add some basic facilities to toggle driver autoprobe, maybe setup notifications for the group, and hopefully include a way to share iommu mappings between groups. Anything much beyond that we should probably leave for something like the vfio driver. Thanks, Alex -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html