On Fri, 2016-12-02 at 15:18 +1100, David Gibson wrote: > > So, would the PCIe Root Bus in a pseries guest behave > > differently than the one in a q35 or mach-virt guest? > > Yes. I had a long discussion with BenH and got a somewhat better idea > about this. Sorry, but I'm afraid you're going to have to break this down even further for me :( > If only a single host PE (== iommu group) is passed through and there > are no emulated devices, the difference isn't too bad: basically on > pseries you'll see the subtree that would be below the root complex on > q35. > > But if you pass through multiple groups, things get weird. Is the difference between q35 and pseries guests with respect to PCIe only relevant when it comes to assigned devices, or in general? I'm asking this because you seem to focus entirely on assigned devices. > On q35, > you'd generally expect physically separate (different slot) devices to > appear under separate root complexes. This part I don't get at all, so please bear with me. The way I read it you're claiming that eg. a SCSI controller and a network adapter, being physically separate and assigned to separate PCI slots, should have a dedicated PCIe Root Complex each on a q35 guest. That doesn't match with my experience, where you would simply assign them to separate slots of the default PCIe Root Bus (pcie.0), eg. 00:01.0 and 00:02.0. Maybe you're referring to the fact that you might want to create multiple PCIe Root Complexes in order to assign the host devices to separate guest NUMA nodes? How is creating multiple PCIe Root Complexes on q35 using pxb-pcie different than creating multiple PHBs using spapr-pci-host-bridge on pseries? > Whereas on pseries they'll > appear as siblings on a virtual bus (which makes no physical sense for > point-to-point PCI-E). What is the virtual bus in question? Why would it matter that they're siblings? I'm possibly missing the point entirely, but so far it looks to me like there are different configurations you might want to use depending on your goal, and both q35 and pseries give you comparable tools to achieve such configurations. > I suppose we could try treating all devices on pseries as though they > were chipset builtin devices on q35, which will appear on the root > PCI-E bus without root complex. But I suspect that's likely to cause > trouble with hotplug, and it will certainly need different address > allocation from libvirt. PCIe Integrated Endpoint Devices are not hotpluggable on q35, that's why libvirt will follow QEMU's PCIe topology recommendations and place a PCIe Root Port between them; I assume the same could be done for pseries guests as soon as QEMU grows support for generic PCIe Root Ports, something Marcel has already posted patches for. Again, sorry for clearly misunderstanding your explanation, but I'm still not seeing the issue here. I'm sure it's very clear in your mind, but I'm afraid you're going to have to walk me through it :( > > Regardless of how we decide to move forward with the > > PCIe-enabled pseries machine type, libvirt will have to > > know about this so it can behave appropriately. > > So there are kind of two extremes of how to address this. There are a > variety of options in between, but I suspect they're going to be even > more muddled and hideous than the extremes. > > 1) Give up. You said there's already a flag that says a PCI-E bus is > able to accept vanilla-PCI devices. We add a hack flag that says a > vanilla-PCI bus is able to accept PCI-E devices. We keep address > allocation as it is now - the pseries topology really does resemble > vanilla-PCI much better than it does PCI-E. But, we allow PCI-E > devices, and PAPR has mechanisms for accessing the extended config > space. PCI-E standard hotplug and error reporting will never work, > but PAPR provides its own mechanisms for those, so that should be ok. We can definitely special-case pseries guests and take the "anything goes" approach to PCI vs PCIe, but it would certainly be nicer if we could avoid presenting our users the head-scratching situation of PCIe devices being plugged into legacy PCI slots and still showing up as PCIe in the guest. What about virtio devices, which present themselves either as legacy PCI or PCIe depending on the kind of slot they are plugged into? Would they show up as PCIe or legacy PCI on a PCIe-enabled pseries guest? > 2) Start exposing the PCI-E heirarchy for pseries guests much more > like q35, root complexes and all. It's not clear that PAPR actually > *forbids* exposing the root complex, it just doesn't require it and > that's not what PowerVM does. But.. there are big questions about > whether existing guests will cope with this or not. When you start > adding in multiple passed through devices and particularly virtual > functions as well, things could get very ugly - we might need to > construct multiple emulated virtual root complexes or other messes. > > In the short to medium term, I'm thinking option (1) seems pretty > compelling. Is the Root Complex not currently exposed? The Root Bus certainly is, otherwise PCI devices won't work at all, I assume. And I can clearly see a pci.0 bus in the output of 'info qtree' for a pseries guest, and a pci.1 too if I add a spapr-pci-host-bridge. Maybe I just don't quite get the relationship between Root Complexes and Root Buses, but I guess my question is: what is preventing us from simply doing whatever a spapr-pci-host-bridge is doing in order to expose a legacy PCI Root Bus (pci.*) to the guest, and create a new spapr-pcie-host-bridge that exposes a PCIe Root Bus (pcie.*) instead? > So, I'm not sure if the idea of a new machine type has legs or not, > but let's think it through a bit further. Suppose we have a new > machine type, let's call it 'papr'. I'm thinking it would be (at > least with -nodefaults) basically a super-minimal version of pseries: > so each PHB would have to be explicitly created, the VIO bridge would > have to be explicitly created, likewise the NVRAM. Not sure about the > "devices" which really represent firmware features - the RTC, RNG, > hypervisor event source and so forth. > > Might have some advantages. Then again, it doesn't really solve the > specific problem here. It means libvirt (or the user) has to > explicitly choose a PCI or PCI-E PHB to put things on, libvirt would probably add a <controller type='pci' model='pcie-root'/> to the guest XML by default, resulting in a spapr-pcie-host-bridge providing pcie.0 and the same controller / address allocation logic as q35; the user would be able to use <controller type='pci' model='pci-root'/> instead to stick with legacy PCI. This would only matter when using '-nodefaults' anyway, when that flag is not present a PCIe (or legacy PCI) could be created by QEMU to make it more convenient for people that are not using libvirt. Maybe we should have a different model, specific to pseries guests, instead, so that all PHBs would look the same in the guest XML, something like <controller type='pci' model='phb-pcie'/> It would require shuffling libvirt's PCI address allocation code around quite a bit, but it should be doable. And if it makes life easier for our users, then it's worth it. > but libvirt's > PCI-E address allocation will still be wrong in all probability. > > Guh. > As an aside, here's a RANT. [...] Laine already addressed your points extensively, but I'd like to add a few thoughts of my own. * PCI addresses for libvirt guests don't need to be stable only when performing migration, but also to guarantee that no change in guest ABI will happen as a consequence of eg. a simple power cycle. * Even if libvirt left all PCI address assignment to QEMU, we would need a way for users to override QEMU's choices, because one size never fits all and users have all kinds of crazy, yet valid, requirements. So the first time we run QEMU, we would have to take the backend-specific format you suggest, parse it to extract the PCI addresses that have been assigned, and reflect them in the guest XML so that the user can change a bunch of them. Then I guess we could re-encode it in the backend-specific format and pass it to QEMU the next time we run it but, at that point, what's the difference with simply putting the PCI addresses on the command line directly? * It's not just about the addresses, by the way, but also about the controllers - what model is used, how they are plugged together and so on. More stuff that would have to round-trip because users need to be able to take matters into their own hands. * Design mistakes in any software, combined with strict backwards compatibility requirements, make it difficult to make changes in both related components and the software itself, even when the changes would be very beneficial. It can be very frustrating at times, but it's the reality of things and unfortunately there's only so much we can do about it. * Eduardo's work, which you mentioned, is going to be very beneficial in the long run; in the short run, Marcel's PCIe device placement guidelines, a document that has seen contributions from QEMU, OVMF and libvirt developers, have been invaluable to improve libvirt's PCI address allocation logic. So we're already doing better, and more improvements are on the way :) -- Andrea Bolognani / Red Hat / Virtualization -- libvir-list mailing list libvir-list@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/libvir-list