Re: PCIe hotplug resource issues with PEX switch (NVMe disks) on AMD Epyc system

Stefan Roese <sr@xxxxxxx> · Fri, 13 Dec 2019 13:17:03 +0100

On 13.12.19 12:52, Nicholas Johnson wrote:
On Fri, Dec 13, 2019 at 11:58:53AM +0100, Stefan Roese wrote:
Hi Nicholas,

On 13.12.19 10:00, Nicholas Johnson wrote:
On Fri, Dec 13, 2019 at 09:35:19AM +0100, Stefan Roese wrote:
Hi!
Hi,

I am facing an issue with PCIe-Hotplug on an AMD Epyc based system.
Our system is equipped with an HBA for NVMe SSDs incl. PCIe switch
(Supermicro AOC-SLG3-4E2P) [1] and we would like to be able to hotplug
NVMe disks.

Currently, I'm testing with v5.5.0-rc1 and series [2] applied. Here
a few tests and results that I did so far. All tests were done with
one Intel NVMe SSD connected to one of the 4 NVMe ports of the HBA
and the other 3 ports (currently) left unconnected:

a) Kernel Parameter "pci=pcie_bus_safe"
The resources of the 3 unused PCIe slots of the PEX switch are not
assigned in this test.

b) Kernel Parameter "pci=pcie_bus_safe,hpmemsize=0,hpiosize=0,hpmmiosize=1M,hpmmioprefsize=0"
With this test I restricted the resources of the HP slots to the
minimum. Still this results in unassigned resourced for the unused
PCIe slots of the PEX switch.

c) Kernel Parameter "pci=realloc,pcie_bus_safe,hpmemsize=0,hpiosize=0,hpmmiosize=1M,hpmmioprefsize=0"
Again, not all resources are assigned.

d) Kernel Parameter "pci=nocrs,realloc,pcie_bus_safe,hpmemsize=0,hpiosize=0,hpmmiosize=1M,hpmmioprefsize=0"
Now all requested resources are available for the HP PCIe slots of the
PEX switch. But the NVMe driver fails while probing. Debugging has
shown, that reading from the BAR of the NVMe disk returns 0xffffffff.
Also reading from the PLX PEX switch registers returns 0xfffffff in this
case (this works of course without nocrs, when the BARs are mapped at
a different address).

Does anybody have a clue on why the access to the PEX switch and / or
the NVMe BAR does not work in the "nocrs" case? The BARs are located in
the same window that is provided by the BIOS in the ACPI list (but is
"ignored" in this case) [3].

Or if it is possible to get the HP resource mapping done correctly without
setting "nocrs" for our setup with the PCIe/NVMe switch?

I can provide all sorts of logs (dmegs, lspci etc) if needed - just let
me know.

Many thanks in advance,
Stefan
This will be a quick response for now. I will get more in depth tonight
when I have more time.

What I have taken away from this is:

1. Epyc -> Up to 4x PCIe Root Complexes, but from what I can gather,
they are probably assigned on the same segment / domain, unfortunately,
with non-overlapping bus numbers. Either way, multiple RCs may
complicate using pci=nocrs and others. Unfortunately, I have not had the
privilege of owning a system with multiple RCs, so I cannot be sure.

2. Not using Thunderbolt - [2] patch series only really makes a
difference with nested hotplug bridges, such as in Thunderbolt.
Although, it might help by not using additional resource lists, but I
still do not think it will matter without nested hotplug bridges.

I was not sure about those patches but since they have been queued for
5.6, I included them in these tests. The results are similar (or even
identical, I would need to re-run the test to be sure) without them.
3. System not reallocating resources despite overridden -> is ACPI _DSM
method evaluating to zero?

Not sure if I follow you here. The kernel is reallocating the resources, or
at least trying to, if requested to via bootargs (Tests c) and d)). I've
attached the logs from all 4 tests in an archive [1]. It just fails to
reallocate the resources in test case c) and even though it successfully
reallocates the resources in test case d), the new addresses at the PEX
switch and its ports "don't work".
It is unlikely to be the issue, but I thought it was worth a mention.

I experienced this recently with an Intel Ice
Lake system. I booted the laptop at the retail store into Linux off a
USB to find out about the Thunderbolt implementation. I dumped "sudo
lspci -xxxx" and dmesg and analysed the results at home.

Very brave. ;)
It's a retail store with display models for people to play with. If I do
not damage it (or pay for any damage caused) then I do not have anything
to be afraid of.

Sure. I was referring to you being "brave" to do all this analyzing /
debugging without having the system at your hands while doing this for any
further tests. ;)

I noticed it
did not override the resources, and from examining the source code, it
likely evaluated _DSM to 0, which may have overridden pci=realloc. Try
modifying the source code to unconditionally apply realloc in
drivers/pci/setup-bus.c and see what happens. I have not bothered doing
this myself and going back to the store to try to test this hypothesis.

realloc is enabled via boot args and active in the kernel as you can see
from the dmesg log [2].
4. It would be helpful if you attached full dmesg and "sudo lspci -xxxx"
which dumps full PCI config, allowing us to run any lspci query as if we
were on your system, from the file. I will be able to tell a lot more
after seeing that. Possibly do one with no kernel parameters, and do
another set of results with all of the kernel parameters. Use
hpmmiosize=64M and hpmmioprefsize=1G for it to be noticeable, I reckon.
But this will answer questions I have about which ports are hotplug
bridges and other things.

Okay, I added the following test cases:

e) Kernel Parameter ""
f) Kernel Parameter "pci=nocrs,realloc,hpmmiosize=64M,hpmmioprefsize=1G"

The logs are also included. Please let me know, if I should do any other
tests and provide the logs.

5. There is a good chance it will not even boot since kernel since
around ~v5.3 with acpi=off but it is worth a shot there, also. Since a
recent kernel, I have found that acpi=off only removes HyperThreading,
and not all the physical cores like it used to. So there must have been
a patch which allowed it to guess the MADT table information. I have not
investigated. But now, some of my computers crash upon loading the
kernel with acpi=off. It must get it wrong at times.

Booting this 5.5 kernel with "acpi=off" increases the bootup time quite
a bit. The resources are distributed behind the PLX switch (similar to
using "pci=nocrs" but again accessing the BARs doesn't work (0xffffffff
is read back).
It was only to see if ACPI was part of the issue. You would not run in
production with it off.

What about
pci=noacpi instead?

I also tested using pci=noacpi and it did not resolve the resource
mapping problems.
Sorry if I missed something you said.

Best of luck, and I am interested into looking into this further. :)

Very much appreciated. :)

Thanks,
Stefan

[1] logs.tar.bz2
[2] 5.5.0-rc1-custom-test-c/dmesg.log

 From the logs, it looks like MMIO_PREF was assigned 1G but not MMIO.

This looks tricky. Please revert my commit:
c13704f5685deb7d6eb21e293233e0901ed77377

And see if it is the problem.

I reverted this patch and did a few test (some of my test cases). None
turned out differently than before. Either the resources are not mapped
completely or they are mapped  (with pci=nocrs) and not accessible.

It is entirely possible, but because of
the very old code and how there are multiple passes, it might be
impossible to use realloc without side effects for somebody. If you fix
it for one scenario, it is possible that there is another scenario for
which it will break due to the change. The only way to make everything
work is a near complete rewrite of drivers/pci/setup-bus.c and
potentially others, something I am working on, but is going to take a
long time. And unlikely to ever be accepted.

While working on this issue, I looked (again) at this resource (re-)
allocation code. This is really confusing (at least to me) and I also think
that it needs a "near complete rewrite".

Otherwise, it will take me a lot of grepping through dmesg to find the
cause, which will take more time.

Sure.

FYI, "lspci -vvv" is redundant because it can be produced from "lspci
-xxxx" output.

I know. Its mainly for me to easily see the PCI devices listed quickly.

A final note, Epyc CPUs can bifurcate x16 slots into x4/x4/x4/x4 in the
BIOS setup, although you will probably not have the hotplug services
provided by the PEX switch.

I think it should not matter for my current test with resource assignment,
how many PCIe lanes the PEX switch has connected to the PCI root port. Its
of course important for the bandwidth, but this is a completely different
issue.

Thanks,
Stefan