Resending because of bad qemu-devel address... On Thu, 5 Jan 2017 16:46:18 +1100 David Gibson <david@xxxxxxxxxxxxxxxxxxxxx> wrote: > There was a discussion back in November on the qemu list which spilled > onto the libvirt list about how to add support for PCIe devices to > POWER VMs, specifically 'pseries' machine type PAPR guests. > > Here's a more concrete proposal for how to handle part of this in > future from the libvirt side. Strictly speaking what I'm suggesting > here isn't intrinsically linked to PCIe: it will make adding PCIe > support sanely easier, as well as having a number of advantages for > both PCIe and plain-PCI devices on PAPR guests. > > Background: > > * Currently the pseries machine type only supports vanilla PCI > buses. > * This is a qemu limitation, not something inherent - PAPR guests > running under PowerVM (the IBM hypervisor) can use passthrough > PCIe devices (PowerVM doesn't emulate devices though). > * In fact the way PCI access is para-virtalized in PAPR makes the > usual distinctions between PCI and PCIe largely disappear > * Presentation of PCIe devices to PAPR guests is unusual > * Unlike x86 - and other "bare metal" platforms, root ports are > not made visible to the guest. i.e. all devices (typically) > appear as though they were integrated devices on x86 > * In terms of topology all devices will appear in a way similar to > a vanilla PCI bus, even PCIe devices > * However PCIe extended config space is accessible > * This means libvirt's usual placement of PCIe devices is not > suitable for PAPR guests > * PAPR has its own hotplug mechanism > * This is used instead of standard PCIe hotplug > * This mechanism works for both PCIe and vanilla-PCI devices > * This can hotplug/unplug devices even without a root port P2P > bridge between it and the root "bus > * Multiple independent host bridges are routine on PAPR > * Unlike PC (where all host bridges have multiplexed access to > configuration space) PCI host bridges (PHBs) are truly > independent for PAPR guests (disjoint MMIO regions in system > address space) > * PowerVM typically presents a separate PHB to the guest for each > host slot passed through > > The Proposal: > > I suggest that libvirt implement a new default algorithm for placing > (i.e. assigning addresses to) both PCI and PCIe devices for (only) > PAPR guests. > > The short summary is that by default it should assign each device to a > separate vPHB, creating vPHBs as necessary. > > * For passthrough sometimes a group of host devices can't be safely > isolated from each other - this is known as a (host) Partitionable > Endpoint (PE). In this case, if any device in the PE is passed > through to a guest, the whole PE must be passed through to the > same vPHB in the guest. From the guest POV, each vPHB has exactly > one (guest) PE. > * To allow for hotplugged devices, libvirt should also add a number > of additional, empty vPHBs (the PAPR spec allows for hotplug of > PHBs, but this is not yet implemented in qemu). When hotplugging > a new device (or PE) libvirt should locate a vPHB which doesn't > currently contain anything. > * libvirt should only (automatically) add PHBs - never root ports or > other PCI to PCI bridges > > In order to handle migration, the vPHBs will need to be represented in > the domain XML, which will also allow the user to override this > topology if they want. > > Advantages: > > There are still some details I need to figure out w.r.t. handling PCIe > devices (on both the qemu and libvirt sides). However the fact that One such detail may be that PCIe devices should have the "ibm,pci-config-space-type" property set to 1 in the DT, for the driver to be able to access the extended config space. > PAPR guests don't typically see PCIe root ports means that the normal > libvirt PCIe allocation scheme won't work. This scheme has several > advantages with or without support for PCIe devices: > > * Better performance for 32-bit devices > > With multiple devices on a single vPHB they all must share a (fairly > small) 32-bit DMA/IOMMU window. With separate PHBs they each have a > separate window. PAPR guests have an always-on guest visible IOMMU. > > * Better EEH handling for passthrough devices > > EEH is an IBM hardware-assisted mechanism for isolating and safely > resetting devices experiencing hardware faults so they don't bring > down other devices or the system at large. It's roughly similar to > PCIe AER in concept, but has a different IBM specific interface, and > works on both PCI and PCIe devices. > > Currently the kernel interfaces for handling EEH events on passthrough > devices will only work if there is a single (host) iommu group in the > vfio container. While lifting that restriction would be nice, it's > quite difficult to do so (it requires keeping state synchronized > between multiple host groups). That also means that an EEH error on > one device could stop another device where that isn't required by the > actual hardware. > > The unit of EEH isolation is a PE (Partitionable Endpoint) and > currently there is only one guest PE per vPHB. Changing this might > also be possible, but is again quite complex and may result in > confusing and/or broken distinctions between groups for EEH isolation > and IOMMU isolation purposes. > > Placing separate host groups in separate vPHBs sidesteps these > problems. > > * Guest NUMA node assignment of devices > > PAPR does not (and can't reasonably) use the pxb device. Instead to > allocate devices to different guest NUMA nodes they should be placed > on different vPHBs. Placing them on different PHBs by default allows > NUMA node to be assigned to those PHBs in a straightforward manner. >
Attachment:
pgpzIc52Sa52E.pgp
Description: OpenPGP digital signature
-- libvir-list mailing list libvir-list@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/libvir-list