Re: kvm PCI assignment & VFIO ramblings

Alex Williamson <alex.williamson@xxxxxxxxxx> · Tue, 02 Aug 2011 09:34:58 -0600

On Tue, 2011-08-02 at 22:58 +1000, Benjamin Herrenschmidt wrote:
> 
> Don't worry, it took me a while to get my head around the HW :-) SR-IOV
> VFs will generally not have limitations like that no, but on the other
> hand, they -will- still require 1 VF = 1 group, ie, you won't be able to
> take a bunch of VFs and put them in the same 'domain'.
> 
> I think the main deal is that VFIO/qemu sees "domains" as "guests" and
> tries to put all devices for a given guest into a "domain".

Actually, that's only a recent optimization, before that each device got
it's own iommu domain.  It's actually completely configurable on the
qemu command line which devices get their own iommu and which share.
The default optimizes the number of domains (one) and thus the number of
mapping callbacks since we pin the entire guest.

> On POWER, we have a different view of things were domains/groups are
> defined to be the smallest granularity we can (down to a single VF) and
> we give several groups to a guest (ie we avoid sharing the iommu in most
> cases)
> 
> This is driven by the HW design but that design is itself driven by the
> idea that the domains/group are also error isolation groups and we don't
> want to take all of the IOs of a guest down if one adapter in that guest
> is having an error.
> 
> The x86 domains are conceptually different as they are about sharing the
> iommu page tables with the clear long term intent of then sharing those
> page tables with the guest CPU own. We aren't going in that direction
> (at this point at least) on POWER..

Yes and no.  The x86 domains are pretty flexible and used a few
different ways.  On the host we do dynamic DMA with a domain per device,
mapping only the inflight DMA ranges.  In order to achieve the
transparent device assignment model, we have to flip that around and map
the entire guest.  As noted, we can continue to use separate domains for
this, but since each maps the entire guest, it doesn't add a lot of
value and uses more resources and requires more mapping callbacks (and
x86 doesn't have the best error containment anyway).  If we had a well
supported IOMMU model that we could adapt for pvDMA, then it would make
sense to keep each device in it's own domain again.  Thanks,

Alex

--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html