Re: kvm PCI assignment & VFIO ramblings

Benjamin Herrenschmidt <benh@xxxxxxxxxxxxxxxxxxx> · Tue, 09 Aug 2011 21:48:33 -0500

> Mostly correct, yes.  x86 isn't immune to the group problem, it shows up
> for us any time there's a PCIe-to-PCI bridge in the device hierarchy.
> We lose resolution of devices behind the bridge.  As you state though, I
> think of this as only a constraint on what we're able to do with those
> devices.
> 
> Perhaps part of the differences is that on x86 the constraints don't
> really effect how we expose devices to the guest.  We need to hold
> unused devices in the group hostage and use the same iommu domain for
> any devices assigned, but that's not visible to the guest.  AIUI, POWER
> probably needs to expose the bridge (or at least an emulated bridge) to
> the guest, any devices in the group need to show up behind that bridge,

Yes, pretty much, essentially because a group must have as shared iommu
domain and so due to the way our PV representation works, that means the
iommu DMA window is to be exposed by a bridge that covers all the
devices of that group.

> some kind of pvDMA needs to be associated with that group, there might
> be MMIO segments and IOVA windows, etc.  

The MMIO segments are mostly transparent to the guest, we just tell it
where the BARs are and it leaves them alone, at least that's how it
works under pHyp.

Currently on our qemu/vfio expriments, we do let the guest do the BAR
assignment via the emulated stuff using a hack to work around the guest
expectation that the BARs have been already setup (I can fill you on the
details if you really care but it's not very interesting). It works
because we only ever used that on setups where we had a device == a
group, but it's nasty. But in any case, because they are going to be
always in separate pages, it's not too hard for KVM to remap them
wherewver we want so MMIO is basically a non-issue.

> Effectively you want to
> transplant the entire group into the guest.  Is that right?  Thanks,

Well, at least we want to have a bridge for the group (it could and
probably should be a host bridge, ie, an entire PCI domain, that's a lot
easier than trying to mess around with virtual P2P bridges).

>From there, I don't care if we need to expose explicitly each device of
that group one by one. IE. It would be a nice "optimziation" to have the
ability to just specify the group and have qemu pick them all up but it
doesn't really matter in the grand scheme of things.

Currently, we do expose individual devices, but again, it's hacks and it
won't work on many setups etc... with horrid consequences :-) We need to
sort that before we can even think of merging that code on our side.

Cheers,
Ben.

> Alex
> 
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html