On Thu, Nov 18, 2021 at 5:03 PM Rob Herring <robh@xxxxxxxxxx> wrote: > > On Thu, Nov 18, 2021 at 3:20 PM Rob Herring <robh@xxxxxxxxxx> wrote: > > > > On Thu, Nov 18, 2021 at 12:10 PM Stéphane Graber <stgraber@xxxxxxxxxx> wrote: > > > > > > Hello, > > > > > > I've recently been given access to a set of 4 APM X-Gene2 Merlin > > > boards (old-ish development platform). > > > Running them on Ubuntu 20.04's stock 5.4 kernel worked fine but trying > > > to run anything else would fail to boot due to a NVME initialization > > > timeout preventing the main drive from showing up at all. > > > > > > Tracking this issue, I first moved to clean mainline kernels and then > > > isolated the issue to be somewhere between 5.4.0 and 5.5.0-rc1, which > > > sadly meant the merge window (so much for a quick bisect...). I've > > > then bisected between those two points and came up with: > > > > > > 6dce5aa59e0bf2430733d7a8b11c205ec10f408e (refs/bisect/bad) PCI: > > > xgene: Use inbound resources for setup > > > > > > I finally switched to the latest 5.15.2 tree, reverted that one > > > commit, built a new kernel and confirmed that those boards now work > > > flawlessly. > > > > > > Unfortunately that's about the extent of my abilities with kernel > > > debugging and I won't pretend to understand what that commit does or > > > how it may be breaking PCIe initialization on those systems. > > > > > > I'm not technically blocked on this, I can manually build my own > > > kernels by reverting that one commit every time, but that's obviously > > > not ideal and I'd much rather have this fixed upstream :) > > > > Doesn't this platform have ACPI f/w you can use? From the log, it > > looks like ACPI tables are passed to the kernel, but since a full DT > > is passed it is used by default. Does 'acpi=on' not work? Gave that a try with a clean 5.15.2 and unfortunately it's not booting at all, all I get is: Loading Linux 5.15.2 ... Loading initial ramdisk ... EFI stub: Booting Linux Kernel... EFI stub: EFI_RNG_PROTOCOL unavailable EFI stub: ERROR: FIRMWARE BUG: kernel image not aligned on 64k boundary EFI stub: ERROR: FIRMWARE BUG: Image BSS overlaps adjacent EFI memory region EFI stub: Using DTB from configuration table EFI stub: Exiting boot services... L3C: 8MB > > Given no one noticed the breakage for 2 years, I'd really like to > > remove these dts and binding files otherwise someone needs to convert > > bindings to schema and fix warnings. Current stats look like this: > > Processing apm: > > warnings: 240 > > undocumented compat: 114 > > > > For example, I noticed that dma-ranges declares the entries are 32-bit > > (0x42000000 is 32-bit prefetch), yet the PCI bus address and sizes are > > >32-bit. AFAICT, that isn't part of the problem here. > > > > > == Good boot on 5.15.2 (commit reverted) == > > > Full log at: https://gist.github.com/stgraber/e489b7e55dd7ffaac9f77dd8634ca2ff > > > > > > root@entak:~# dmesg | grep -Ei "nvme|pci" > > > [ 0.094146] PCI: CLS 0 bytes, default 64 > > > [ 0.130573] shpchp: Standard Hot Plug PCI Controller Driver version: 0.4 > > > [ 0.131324] xgene-pcie 1f2b0000.pcie: host bridge /soc/pcie@1f2b0000 ranges: > > > [ 0.131344] xgene-pcie 1f2b0000.pcie: No bus range found for > > > /soc/pcie@1f2b0000, using [bus 00-ff] > > > [ 0.131365] xgene-pcie 1f2b0000.pcie: IO > > > 0xc010000000..0xc01000ffff -> 0x0000000000 > > > [ 0.131388] xgene-pcie 1f2b0000.pcie: MEM > > > 0xc120000000..0xc13fffffff -> 0x0020000000 > > > [ 0.131401] xgene-pcie 1f2b0000.pcie: MEM > > > 0xe000000000..0xffffffffff -> 0xe000000000 > > > [ 0.131416] xgene-pcie 1f2b0000.pcie: IB MEM > > > 0x8000000000..0x807fffffff -> 0x8000000000 > > > [ 0.131427] xgene-pcie 1f2b0000.pcie: IB MEM > > > 0x0000000000..0x7fffffffff -> 0x0000000000 > > > > My best guess is while the above is the parsed order of 'IB MEM' > > regions, we sort the entries by address now and that changes which > > inbound registers get used for each region. And one doesn't handle > > > 32-bit addresses. Can you try out this change? It's not what I want > > for a final change because the code is just as fragile: > > Actually, a better change is this: > > diff --git a/drivers/pci/controller/pci-xgene.c > b/drivers/pci/controller/pci-xgene.c > index 56d0d50338c8..d83dbd977418 100644 > --- a/drivers/pci/controller/pci-xgene.c > +++ b/drivers/pci/controller/pci-xgene.c > @@ -465,7 +465,7 @@ static int xgene_pcie_select_ib_reg(u8 > *ib_reg_mask, u64 size) > return 1; > } > > - if ((size > SZ_1K) && (size < SZ_1T) && !(*ib_reg_mask & (1 << 0))) { > + if ((size > SZ_1K) && (size < SZ_4G) && !(*ib_reg_mask & (1 << 0))) { > *ib_reg_mask |= (1 << 0); > return 0; > } Just tested it, and it booted just fine! Full boot log: https://gist.github.com/stgraber/41b2419ef88611ab7a2b4dceb028b4f7 Stéphane