Re: Proposal PCI/PCIe device placement on PAPR guests

David Gibson <david@xxxxxxxxxxxxxxxxxxxxx> · Mon, 9 Jan 2017 10:43:49 +1100

On Fri, Jan 06, 2017 at 12:57:58PM +0100, Greg Kurz wrote:
> On Thu, 5 Jan 2017 16:46:18 +1100
> David Gibson <david@xxxxxxxxxxxxxxxxxxxxx> wrote:
> 
> > There was a discussion back in November on the qemu list which spilled
> > onto the libvirt list about how to add support for PCIe devices to
> > POWER VMs, specifically 'pseries' machine type PAPR guests.
> > 
> > Here's a more concrete proposal for how to handle part of this in
> > future from the libvirt side.  Strictly speaking what I'm suggesting
> > here isn't intrinsically linked to PCIe: it will make adding PCIe
> > support sanely easier, as well as having a number of advantages for
> > both PCIe and plain-PCI devices on PAPR guests.
> > 
> > Background:
> > 
> >  * Currently the pseries machine type only supports vanilla PCI
> >    buses.
> >     * This is a qemu limitation, not something inherent - PAPR guests
> >       running under PowerVM (the IBM hypervisor) can use passthrough
> >       PCIe devices (PowerVM doesn't emulate devices though).
> >     * In fact the way PCI access is para-virtalized in PAPR makes the
> >       usual distinctions between PCI and PCIe largely disappear
> >  * Presentation of PCIe devices to PAPR guests is unusual
> >     * Unlike x86 - and other "bare metal" platforms, root ports are
> >       not made visible to the guest. i.e. all devices (typically)
> >       appear as though they were integrated devices on x86
> >     * In terms of topology all devices will appear in a way similar to
> >       a vanilla PCI bus, even PCIe devices
> >        * However PCIe extended config space is accessible
> >     * This means libvirt's usual placement of PCIe devices is not
> >       suitable for PAPR guests
> >  * PAPR has its own hotplug mechanism
> >     * This is used instead of standard PCIe hotplug
> >     * This mechanism works for both PCIe and vanilla-PCI devices
> >     * This can hotplug/unplug devices even without a root port P2P
> >       bridge between it and the root "bus
> >  * Multiple independent host bridges are routine on PAPR
> >     * Unlike PC (where all host bridges have multiplexed access to
> >       configuration space) PCI host bridges (PHBs) are truly
> >       independent for PAPR guests (disjoint MMIO regions in system
> >       address space)
> >     * PowerVM typically presents a separate PHB to the guest for each
> >       host slot passed through
> > 
> > The Proposal:
> > 
> > I suggest that libvirt implement a new default algorithm for placing
> > (i.e. assigning addresses to) both PCI and PCIe devices for (only)
> > PAPR guests.
> > 
> > The short summary is that by default it should assign each device to a
> > separate vPHB, creating vPHBs as necessary.
> > 
> >   * For passthrough sometimes a group of host devices can't be safely
> >     isolated from each other - this is known as a (host) Partitionable
> >     Endpoint (PE).  In this case, if any device in the PE is passed
> >     through to a guest, the whole PE must be passed through to the
> >     same vPHB in the guest.  From the guest POV, each vPHB has exactly
> >     one (guest) PE.
> >   * To allow for hotplugged devices, libvirt should also add a number
> >     of additional, empty vPHBs (the PAPR spec allows for hotplug of
> >     PHBs, but this is not yet implemented in qemu).  When hotplugging
> >     a new device (or PE) libvirt should locate a vPHB which doesn't
> >     currently contain anything.
> >   * libvirt should only (automatically) add PHBs - never root ports or
> >     other PCI to PCI bridges
> > 
> > In order to handle migration, the vPHBs will need to be represented in
> > the domain XML, which will also allow the user to override this
> > topology if they want.
> > 
> > Advantages:
> > 
> > There are still some details I need to figure out w.r.t. handling PCIe
> > devices (on both the qemu and libvirt sides).  However the fact that
> 
> One such detail may be that PCIe devices should have the
> "ibm,pci-config-space-type" property set to 1 in the DT,
> for the driver to be able to access the extended config
> space.

Right.

> > PAPR guests don't typically see PCIe root ports means that the normal
> > libvirt PCIe allocation scheme won't work.  This scheme has several
> > advantages with or without support for PCIe devices:
> > 
> >  * Better performance for 32-bit devices
> > 
> > With multiple devices on a single vPHB they all must share a (fairly
> > small) 32-bit DMA/IOMMU window.  With separate PHBs they each have a
> > separate window.  PAPR guests have an always-on guest visible IOMMU.
> > 
> >  * Better EEH handling for passthrough devices
> > 
> > EEH is an IBM hardware-assisted mechanism for isolating and safely
> > resetting devices experiencing hardware faults so they don't bring
> > down other devices or the system at large.  It's roughly similar to
> > PCIe AER in concept, but has a different IBM specific interface, and
> > works on both PCI and PCIe devices.
> > 
> > Currently the kernel interfaces for handling EEH events on passthrough
> > devices will only work if there is a single (host) iommu group in the
> > vfio container.  While lifting that restriction would be nice, it's
> > quite difficult to do so (it requires keeping state synchronized
> > between multiple host groups).  That also means that an EEH error on
> > one device could stop another device where that isn't required by the
> > actual hardware.
> > 
> > The unit of EEH isolation is a PE (Partitionable Endpoint) and
> > currently there is only one guest PE per vPHB.  Changing this might
> > also be possible, but is again quite complex and may result in
> > confusing and/or broken distinctions between groups for EEH isolation
> > and IOMMU isolation purposes.
> > 
> > Placing separate host groups in separate vPHBs sidesteps these
> > problems.
> > 
> >  * Guest NUMA node assignment of devices
> > 
> > PAPR does not (and can't reasonably) use the pxb device.  Instead to
> > allocate devices to different guest NUMA nodes they should be placed
> > on different vPHBs.  Placing them on different PHBs by default allows
> > NUMA node to be assigned to those PHBs in a straightforward manner.
> > 
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson
Attachment:
signature.asc

Description: PGP signature
--
libvir-list mailing list
libvir-list@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/libvir-list