Re: [Qemu-ppc] [RFC PATCH qemu] spapr_pci: Create PCI-express root bus by default

David Gibson <david@xxxxxxxxxxxxxxxxxxxxx> · Wed, 7 Dec 2016 15:11:21 +1100

On Tue, Dec 06, 2016 at 06:30:47PM +0100, Andrea Bolognani wrote:
> On Fri, 2016-12-02 at 15:18 +1100, David Gibson wrote:
> > > So, would the PCIe Root Bus in a pseries guest behave
> > > differently than the one in a q35 or mach-virt guest?
> > 
> > Yes.  I had a long discussion with BenH and got a somewhat better idea
> > about this.
> 
> Sorry, but I'm afraid you're going to have to break this
> down even further for me :(
> 
> > If only a single host PE (== iommu group) is passed through and there
> > are no emulated devices, the difference isn't too bad: basically on
> > pseries you'll see the subtree that would be below the root complex on
> > q35.
> > 
> > But if you pass through multiple groups, things get weird.
> 
> Is the difference between q35 and pseries guests with
> respect to PCIe only relevant when it comes to assigned
> devices, or in general? I'm asking this because you seem to
> focus entirely on assigned devices.

Well, in a sense that's up to us.  The only existing model we have is
PowerVM, and PowerVM only does device passthrough, no emulated
devices.  PAPR doesn't really distinguish one way or the other, but
it's written from the perspective of assuming that all PCI devices
correspond to physical devices on the host

> > On q35,
> > you'd generally expect physically separate (different slot) devices to
> > appear under separate root complexes.
> 
> This part I don't get at all, so please bear with me.
> 
> The way I read it you're claiming that eg. a SCSI controller
> and a network adapter, being physically separate and assigned
> to separate PCI slots, should have a dedicated PCIe Root
> Complex each on a q35 guest.

Right, my understanding was that if the devices were slotted, rather
than integrated, each one would sit under a separate root complex, the
root complex being a pseudo PCI to PCI bridge.

> That doesn't match with my experience, where you would simply
> assign them to separate slots of the default PCIe Root Bus
> (pcie.0), eg. 00:01.0 and 00:02.0.

The qemu default, or the libvirt default?  I think this represents
treating the devices as though they were integrated devices in the
host bridge.  I believe on q35 they would not be hotpluggable - but on
pseries they would be (because we don't use the standard hot plug
controller).

> Maybe you're referring to the fact that you might want to
> create multiple PCIe Root Complexes in order to assign the
> host devices to separate guest NUMA nodes? How is creating
> multiple PCIe Root Complexes on q35 using pxb-pcie different
> than creating multiple PHBs using spapr-pci-host-bridge on
> pseries?

Uh.. AIUI the root complex is the PCI to PCI bridge under which PCI-E
slots appear.  PXB is something different - essentially different host
bridges as you say (though with some weird hacks to access config
space, which make it dependent on the primary bus in a way which spapr
PHBs are not).

I'll admit I'm pretty confused myself about the exact distinction
between root complex, root port and upstream and downstream ports.

> > Whereas on pseries they'll
> > appear as siblings on a virtual bus (which makes no physical sense for
> > point-to-point PCI-E).
> 
> What is the virtual bus in question? Why would it matter
> that they're siblings?

On pseries it won't.  But my understanding is that libvirt won't
create them that way on q35 - instead it will insert the RCs / P2P
bridges to allow them to be hotplugged.  Inserting that bridge may
confuse pseries guests which aren't expecting it.

> I'm possibly missing the point entirely, but so far it
> looks to me like there are different configurations you
> might want to use depending on your goal, and both q35
> and pseries give you comparable tools to achieve such
> configurations.
> 
> > I suppose we could try treating all devices on pseries as though they
> > were chipset builtin devices on q35, which will appear on the root
> > PCI-E bus without root complex.  But I suspect that's likely to cause
> > trouble with hotplug, and it will certainly need different address
> > allocation from libvirt.
> 
> PCIe Integrated Endpoint Devices are not hotpluggable on
> q35, that's why libvirt will follow QEMU's PCIe topology
> recommendations and place a PCIe Root Port between them;
> I assume the same could be done for pseries guests as
> soon as QEMU grows support for generic PCIe Root Ports,
> something Marcel has already posted patches for.

Here you've hit on it.  No, we should not do that for pseries,
AFAICT.  PAPR doesn't really have the concept of integrated endpoint
devices, and all devices can be hotplugged via the PAPR mechanisms
(and none can via the PCI-E standard hotplug mechanism).

> Again, sorry for clearly misunderstanding your explanation,
> but I'm still not seeing the issue here. I'm sure it's very
> clear in your mind, but I'm afraid you're going to have to
> walk me through it :(

I wish it were entirely clear in my mind.  Like I say I'm still pretty
confused by exactly the root complex entails.

> > > Regardless of how we decide to move forward with the
> > > PCIe-enabled pseries machine type, libvirt will have to
> > > know about this so it can behave appropriately.
> > 
> > So there are kind of two extremes of how to address this.  There are a
> > variety of options in between, but I suspect they're going to be even
> > more muddled and hideous than the extremes.
> > 
> > 1) Give up.  You said there's already a flag that says a PCI-E bus is
> > able to accept vanilla-PCI devices.  We add a hack flag that says a
> > vanilla-PCI bus is able to accept PCI-E devices.  We keep address
> > allocation as it is now - the pseries topology really does resemble
> > vanilla-PCI much better than it does PCI-E. But, we allow PCI-E
> > devices, and PAPR has mechanisms for accessing the extended config
> > space.  PCI-E standard hotplug and error reporting will never work,
> > but PAPR provides its own mechanisms for those, so that should be ok.
> 
> We can definitely special-case pseries guests and take
> the "anything goes" approach to PCI vs PCIe, but it would
> certainly be nicer if we could avoid presenting our users
> the head-scratching situation of PCIe devices being plugged
> into legacy PCI slots and still showing up as PCIe in the
> guest.
> 
> What about virtio devices, which present themselves either
> as legacy PCI or PCIe depending on the kind of slot they
> are plugged into? Would they show up as PCIe or legacy PCI
> on a PCIe-enabled pseries guest?

That we'd have to address on the qemu side with some 

> > 2) Start exposing the PCI-E heirarchy for pseries guests much more
> > like q35, root complexes and all.  It's not clear that PAPR actually
> > *forbids* exposing the root complex, it just doesn't require it and
> > that's not what PowerVM does.  But.. there are big questions about
> > whether existing guests will cope with this or not.  When you start
> > adding in multiple passed through devices and particularly virtual
> > functions as well, things could get very ugly - we might need to
> > construct multiple emulated virtual root complexes or other messes.
> > 
> > In the short to medium term, I'm thinking option (1) seems pretty
> > compelling.
> 
> Is the Root Complex not currently exposed? The Root Bus
> certainly is,

Like I say, I'm fairly confused myself, but I'm pretty sure that Root
Complex != Root Bus.  The RC sits under the root bus IIRC.. or
possibly it consists of the root bus plus something under it as well.

Now... from what Laine was saying it sounds like more of the
differences between PCI-E placement and PCI placement may be
implemented by libvirt than qemu than I realized.  So possibly we do
want to make the bus be PCI-E on the qemu side, but have libvirt use
the vanilla-PCI placement guidelines rather than PCI-E for pseries.

> otherwise PCI devices won't work at all, I
> assume. And I can clearly see a pci.0 bus in the output
> of 'info qtree' for a pseries guest, and a pci.1 too if
> I add a spapr-pci-host-bridge.
> 
> Maybe I just don't quite get the relationship between Root
> Complexes and Root Buses, but I guess my question is: what
> is preventing us from simply doing whatever a
> spapr-pci-host-bridge is doing in order to expose a legacy
> PCI Root Bus (pci.*) to the guest, and create a new
> spapr-pcie-host-bridge that exposes a PCIe Root Bus (pcie.*)
> instead?

Hrm, the suggestion of providing both a vanilla-PCI and PCI-E host
bridge came up before.  I think one of us spotted a problem with that,
but I don't recall what it was now.  I guess one is how libvirt would
map it's stupid-fake-domain-numbers to which root bus to use.

> > So, I'm not sure if the idea of a new machine type has legs or not,
> > but let's think it through a bit further.  Suppose we have a new
> > machine type, let's call it 'papr'.  I'm thinking it would be (at
> > least with -nodefaults) basically a super-minimal version of pseries:
> > so each PHB would have to be explicitly created, the VIO bridge would
> > have to be explicitly created, likewise the NVRAM.  Not sure about the
> > "devices" which really represent firmware features - the RTC, RNG,
> > hypervisor event source and so forth.
> > 
> > Might have some advantages.  Then again, it doesn't really solve the
> > specific problem here.  It means libvirt (or the user) has to
> > explicitly choose a PCI or PCI-E PHB to put things on,
> 
> libvirt would probably add a
> 
>   <controller type='pci' model='pcie-root'/>
> 
> to the guest XML by default, resulting in a
> spapr-pcie-host-bridge providing pcie.0 and the same
> controller / address allocation logic as q35; the user
> would be able to use
> 
>   <controller type='pci' model='pci-root'/>
> 
> instead to stick with legacy PCI. This would only matter
> when using '-nodefaults' anyway, when that flag is not
> present a PCIe (or legacy PCI) could be created by QEMU
> to make it more convenient for people that are not using
> libvirt.
> 
> Maybe we should have a different model, specific to
> pseries guests, instead, so that all PHBs would look the
> same in the guest XML, something like
> 
>   <controller type='pci' model='phb-pcie'/>
> 
> It would require shuffling libvirt's PCI address allocation
> code around quite a bit, but it should be doable. And if it
> makes life easier for our users, then it's worth it.

Hrm.  So my first inclination would be to stick with the generic
names, and map those to creating new pseries host bridges on pseries
guests.  I would have thought that would be the easier option for
users.  But I may not have realized all the implications yet.

> > but libvirt's
> > PCI-E address allocation will still be wrong in all probability.
> > 
> > Guh.
> 
> > As an aside, here's a RANT.
> [...]
> 
> Laine already addressed your points extensively, but I'd
> like to add a few thoughts of my own.
> 
> * PCI addresses for libvirt guests don't need to be stable
>   only when performing migration, but also to guarantee
>   that no change in guest ABI will happen as a consequence
>   of eg. a simple power cycle.
> 
> * Even if libvirt left all PCI address assignment to QEMU,
>   we would need a way for users to override QEMU's choices,
>   because one size never fits all and users have all kinds
>   of crazy, yet valid, requirements. So the first time we
>   run QEMU, we would have to take the backend-specific
>   format you suggest, parse it to extract the PCI addresses
>   that have been assigned, and reflect them in the guest
>   XML so that the user can change a bunch of them. Then I
>   guess we could re-encode it in the backend-specific format
>   and pass it to QEMU the next time we run it but, at that
>   point, what's the difference with simply putting the PCI
>   addresses on the command line directly?
> 
> * It's not just about the addresses, by the way, but also
>   about the controllers - what model is used, how they are
>   plugged together and so on. More stuff that would have to
>   round-trip because users need to be able to take matters
>   into their own hands.
> 
> * Design mistakes in any software, combined with strict
>   backwards compatibility requirements, make it difficult
>   to make changes in both related components and the
>   software itself, even when the changes would be very
>   beneficial. It can be very frustrating at times, but
>   it's the reality of things and unfortunately there's only
>   so much we can do about it.

I think the above I've touched on in my reply to Laine.

> * Eduardo's work, which you mentioned, is going to be very
>   beneficial in the long run; in the short run, Marcel's
>   PCIe device placement guidelines, a document that has seen
>   contributions from QEMU, OVMF and libvirt developers, have
>   been invaluable to improve libvirt's PCI address allocation
>   logic. So we're already doing better, and more improvements
>   are on the way :)

Right.. so here's the thing, I strongly suspect that Marcel's
guidelines will not be correct for pseries.  I'm not sure if they'll
be definitively wrong, or just different enough from PowerVM that it
might confuse guests, but either way.  Can you send me a link to that
document though, which might help me figure this out.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson
Attachment:
signature.asc

Description: PGP signature
--
libvir-list mailing list
libvir-list@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/libvir-list