On Thu, Nov 18, 2021 at 12:10 PM Stéphane Graber <stgraber@xxxxxxxxxx> wrote: > > Hello, > > I've recently been given access to a set of 4 APM X-Gene2 Merlin > boards (old-ish development platform). > Running them on Ubuntu 20.04's stock 5.4 kernel worked fine but trying > to run anything else would fail to boot due to a NVME initialization > timeout preventing the main drive from showing up at all. > > Tracking this issue, I first moved to clean mainline kernels and then > isolated the issue to be somewhere between 5.4.0 and 5.5.0-rc1, which > sadly meant the merge window (so much for a quick bisect...). I've > then bisected between those two points and came up with: > > 6dce5aa59e0bf2430733d7a8b11c205ec10f408e (refs/bisect/bad) PCI: > xgene: Use inbound resources for setup > > I finally switched to the latest 5.15.2 tree, reverted that one > commit, built a new kernel and confirmed that those boards now work > flawlessly. > > Unfortunately that's about the extent of my abilities with kernel > debugging and I won't pretend to understand what that commit does or > how it may be breaking PCIe initialization on those systems. > > I'm not technically blocked on this, I can manually build my own > kernels by reverting that one commit every time, but that's obviously > not ideal and I'd much rather have this fixed upstream :) Doesn't this platform have ACPI f/w you can use? From the log, it looks like ACPI tables are passed to the kernel, but since a full DT is passed it is used by default. Does 'acpi=on' not work? Given no one noticed the breakage for 2 years, I'd really like to remove these dts and binding files otherwise someone needs to convert bindings to schema and fix warnings. Current stats look like this: Processing apm: warnings: 240 undocumented compat: 114 For example, I noticed that dma-ranges declares the entries are 32-bit (0x42000000 is 32-bit prefetch), yet the PCI bus address and sizes are >32-bit. AFAICT, that isn't part of the problem here. > == Good boot on 5.15.2 (commit reverted) == > Full log at: https://gist.github.com/stgraber/e489b7e55dd7ffaac9f77dd8634ca2ff > > root@entak:~# dmesg | grep -Ei "nvme|pci" > [ 0.094146] PCI: CLS 0 bytes, default 64 > [ 0.130573] shpchp: Standard Hot Plug PCI Controller Driver version: 0.4 > [ 0.131324] xgene-pcie 1f2b0000.pcie: host bridge /soc/pcie@1f2b0000 ranges: > [ 0.131344] xgene-pcie 1f2b0000.pcie: No bus range found for > /soc/pcie@1f2b0000, using [bus 00-ff] > [ 0.131365] xgene-pcie 1f2b0000.pcie: IO > 0xc010000000..0xc01000ffff -> 0x0000000000 > [ 0.131388] xgene-pcie 1f2b0000.pcie: MEM > 0xc120000000..0xc13fffffff -> 0x0020000000 > [ 0.131401] xgene-pcie 1f2b0000.pcie: MEM > 0xe000000000..0xffffffffff -> 0xe000000000 > [ 0.131416] xgene-pcie 1f2b0000.pcie: IB MEM > 0x8000000000..0x807fffffff -> 0x8000000000 > [ 0.131427] xgene-pcie 1f2b0000.pcie: IB MEM > 0x0000000000..0x7fffffffff -> 0x0000000000 My best guess is while the above is the parsed order of 'IB MEM' regions, we sort the entries by address now and that changes which inbound registers get used for each region. And one doesn't handle > 32-bit addresses. Can you try out this change? It's not what I want for a final change because the code is just as fragile: diff --git a/drivers/pci/controller/pci-xgene.c b/drivers/pci/controller/pci-xgene.c index 56d0d50338c8..18f05b65439e 100644 --- a/drivers/pci/controller/pci-xgene.c +++ b/drivers/pci/controller/pci-xgene.c @@ -465,16 +465,16 @@ static int xgene_pcie_select_ib_reg(u8 *ib_reg_mask, u64 size) return 1; } - if ((size > SZ_1K) && (size < SZ_1T) && !(*ib_reg_mask & (1 << 0))) { - *ib_reg_mask |= (1 << 0); - return 0; - } - if ((size > SZ_1M) && (size < SZ_1T) && !(*ib_reg_mask & (1 << 2))) { *ib_reg_mask |= (1 << 2); return 2; } + if ((size > SZ_1K) && (size < SZ_1T) && !(*ib_reg_mask & (1 << 0))) { + *ib_reg_mask |= (1 << 0); + return 0; + } + return -EINVAL; }