Re: kvm PCI assignment & VFIO ramblings

Avi Kivity <avi@xxxxxxxxxx> · Tue, 02 Aug 2011 12:12:02 +0300

On 08/02/2011 04:27 AM, Benjamin Herrenschmidt wrote:
>
>  I have a feeling you'll be getting the same capabilities sooner or
>  later, or you won't be able to make use of S/R IOV VFs.

I'm not sure why you mean. We can do SR/IOV just fine (well, with some
limitations due to constraints with how our MMIO segmenting works and
indeed some of those are being lifted in our future chipsets but
overall, it works).

Don't those limitations include "all VFs must be assigned to the same 
guest"?

PCI on x86 has function granularity, SRIOV reduces this to VF 
granularity, but I thought power has partition or group granularity 
which is much coarser?

In -theory-, one could do the grouping dynamically with some kind of API
for us as well. However the constraints are such that it's not
practical. Filtering on RID is based on number of bits to match in the
bus number and whether to match the dev and fn. So it's not arbitrary
(but works fine for SR-IOV).

The MMIO segmentation is a bit special too. There is a single MMIO
region in 32-bit space (size is configurable but that's not very
practical so for now we stick it to 1G) which is evenly divided into N
segments (where N is the number of PE# supported by the host bridge,
typically 128 with the current bridges).

Each segment goes through a remapping table to select the actual PE# (so
large BARs use consecutive segments mapped to the same PE#).

For SR-IOV we plan to not use the M32 region. We also have 64-bit MMIO
regions which act as some kind of "accordions", they are evenly divided
into segments in different PE# and there's several of them which we can
"move around" and typically use to map VF BARs.

So, SRIOV VFs *don't* have the group limitation?  Sorry, I'm deluged by 
technical details with no ppc background to put them to, I can't say I'm 
making any sense of this.

>  >
>  >  VFIO here is basically designed for one and only one thing: expose the
>  >  entire guest physical address space to the device more/less 1:1.
>
>  A single level iommu cannot be exposed to guests.  Well, it can be
>  exposed as an iommu that does not provide per-device mapping.

Well, x86 ones can't maybe but on POWER we can and must thanks to our
essentially paravirt model :-) Even if it' wasn't and we used trapping
of accesses to the table, it would work because in practice, even with
filtering, what we end up having is a per-device (or rather per-PE#
table).

>  A two level iommu can be emulated and exposed to the guest.  See
>  http://support.amd.com/us/Processor_TechDocs/48882.pdf for an example.

What you mean 2-level is two passes through two trees (ie 6 or 8 levels
right ?).

(16 or 25)

We don't have that and probably never will. But again, because
we have a paravirt interface to the iommu, it's less of an issue.

Well, then, I guess we need an additional interface to expose that to 
the guest.

>  >  This means:
>  >
>  >     - It only works with iommu's that provide complete DMA address spaces
>  >  to devices. Won't work with a single 'segmented' address space like we
>  >  have on POWER.
>  >
>  >     - It requires the guest to be pinned. Pass-through ->   no more swap
>
>  Newer iommus (and devices, unfortunately) (will) support I/O page faults
>  and then the requirement can be removed.

No. -Some- newer devices will. Out of these, a bunch will have so many
bugs in it it's not usable. Some never will. It's a mess really and I
wouldn't design my stuff based on those premises just yet. Making it
possible to support it for sure, having it in mind, but not making it
the fundation on which the whole API is designed.

The API is not designed around pinning.  It's a side effect of how the 
IOMMU works.  If your IOMMU only maps pages which are under active DMA, 
then it would only pin those pages.

But I see what you mean, the API is designed around up-front 
specification of all guest memory.

>  >     - It doesn't work for POWER server anyways because of our need to
>  >  provide a paravirt iommu interface to the guest since that's how pHyp
>  >  works today and how existing OSes expect to operate.
>
>  Then you need to provide that same interface, and implement it using the
>  real iommu.

Yes. Working on it. It's not very practical due to how VFIO interacts in
terms of APIs but solvable. Eventually, we'll make the iommu Hcalls
almost entirely real-mode for performance reasons.

The original kvm device assignment code was (and is) part of kvm 
itself.  We're trying to move to vfio to allow sharing with non-kvm 
users, but it does reduce flexibility.  We can have an internal vfio-kvm 
interface to update mappings in real time.

>  >  - Performance sucks of course, the vfio map ioctl wasn't mean for that
>  >  and has quite a bit of overhead. However we'll want to do the paravirt
>  >  call directly in the kernel eventually ...
>
>  Does the guest iomap each request?  Why?

Not sure what you mean... the guest calls h-calls for every iommu page
mapping/unmapping, yes. So the performance of these is critical. So yes,
we'll eventually do it in kernel. We just haven't yet.

I see.  x86 traditionally doesn't do it for every request.  We had some 
proposals to do a pviommu that does map every request, but none reached 
maturity.

>
>  So, you have interrupt redirection?  That is, MSI-x table values encode
>  the vcpu, not pcpu?

Not exactly. The MSI-X address is a real PCI address to an MSI port and
the value is a real interrupt number in the PIC.

However, the MSI port filters by RID (using the same matching as PE#) to
ensure that only allowed devices can write to it, and the PIC has a
matching PE# information to ensure that only allowed devices can trigger
the interrupt.

As for the guest knowing what values to put in there (what port address
and interrupt source numbers to use), this is part of the paravirt APIs.

So the paravirt APIs handles the configuration and the HW ensures that
the guest cannot do anything else than what it's allowed to.

Okay, this is something that x86 doesn't have.  Strange that it can 
filter DMA at a fine granularity but not MSI, which is practically the 
same thing.

>
>  Does the BAR value contain the segment base address?  Or is that added
>  later?

It's a shared address space. With a basic configuration on p7ioc for
example we have MMIO going from 3G to 4G (PCI side addresses). BARs
contain the normal PCI address there. But that 1G is divided in 128
segments of equal size which can separately be assigned to PE#'s.

So BARs are allocated by firmware or the kernel PCI code so that devices
in different PEs don't share segments.

Okay, and config space virtualization ensures that the guest can't remap?

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html