On Fri, May 12, 2023 at 9:24 PM Bjorn Helgaas <helgaas@xxxxxxxxxx> wrote: > > [+cc ARM64 folks, in case you have abort handling tips; thread at: > https://lore.kernel.org/r/20230509153912.515218-1-vincenzopalazzodev@xxxxxxxxx] > > Pine64 RockPro64 panics while enumerating some PCIe devices. Adding a > delay avoids the panic. My theory is a PCIe Request Retry Status to a > Vendor ID config read causes an abort that we don't handle. > > > On Tue, May 09, 2023 at 05:39:12PM +0200, Vincenzo Palazzo wrote: > >> ... > >> [ 1.229856] SError Interrupt on CPU4, code 0xbf000002 -- SError > >> [ 1.229860] CPU: 4 PID: 1 Comm: swapper/0 Not tainted 5.9.9-2.0-MANJARO-ARM > >> #1 > >> [ 1.229862] Hardware name: Pine64 RockPro64 v2.1 (DT) > >> [ 1.229864] pstate: 60000085 (nZCv daIf -PAN -UAO BTYPE=--) > >> [ 1.229866] pc : rockchip_pcie_rd_conf+0xb4/0x270 > >> [ 1.229868] lr : rockchip_pcie_rd_conf+0x1b4/0x270 > >> ... > >> [ 1.229939] Kernel panic - not syncing: Asynchronous SError Interrupt > >> ... > >> [ 1.229955] nmi_panic+0x8c/0x90 > >> [ 1.229956] arm64_serror_panic+0x78/0x84 > >> [ 1.229958] do_serror+0x15c/0x160 > >> [ 1.229960] el1_error+0x84/0x100 > >> [ 1.229962] rockchip_pcie_rd_conf+0xb4/0x270 > >> [ 1.229964] pci_bus_read_config_dword+0x6c/0xd0 > >> [ 1.229966] pci_bus_generic_read_dev_vendor_id+0x34/0x1b0 > >> [ 1.229968] pci_scan_single_device+0xa4/0x144 > > On Fri, May 12, 2023 at 12:46:21PM +0200, Vincenzo Palazzo wrote: > > ... Is there any way to tell the kernel "hey we need some more time > > here"? > > We enumerate PCI devices by trying to read the Vendor ID of every > possible device address (see pci_scan_slot()). On PCIe, if a device > doesn't exist at that address, the Vendor ID config read will be > terminated with Unsupported Request (UR) status. This is normal > and happens every time we enumerate devices. > > The crash doesn't happen every time we enumerate, so I don't think > this UR is the problem. Also, if it *were* the problem, adding a > delay would not make any difference. Is this behavior different if there is a switch device forwarding on the UR? On rk3399 switches are completely non-functional because of the panic, which is observed in the output of the dmesg in [2] with the hack patch enabled. Considering what you just described it looks like the forwarded UR for each non-existent device behind the switch is causing an serror. > > There *is* a way for a PCIe device to say "I need more time". It does > this by responding to that Vendor ID config read with Request Retry > Status (RRS, aka CRS in older specs), which means "I'm not ready yet, > but I will be ready in the future." Adding a delay would definitely > make a difference here, so my guess is this is what's happening. > > Most root complexes return ~0 data to the CPU when a config read > terminates with UR or RRS. It sounds like rockchip does this for UR > but possibly not for RRS. > > There is a "RRS Software Visibility" feature, which is supposed to > turn the RRS into a special value (Vendor ID == 0x0001), but per [1], > rockchip doesn't support it (lspci calls it "CRSVisible"). > > But the CPU load instruction corresponding to the config read has to > complete by reading *something* or else be aborted. It sounds like > it's aborted in this case. I don't know the arm64 details, but if we > could catch that abort and determine that it was an RRS and not a UR, > maybe we could fabricate the magic RRS 0x0001 value. > > imx6q_pcie_abort_handler() does something like that, although I think > it's for arm32, not arm64. But obviously we already catch the abort > enough to dump the register state and panic, so maybe there's a way to > extend that? Perhaps a hook mechanism that allows drivers to register with the serror handler and offer to handle specific errors before the generic code causes the system panic? Very Respectfully, Peter Geis [2] https://lore.kernel.org/linux-pci/CAMdYzYqn3L7x-vc+_K6jG0EVTiPGbz8pQ-N1Q1mRbcVXE822Yg@xxxxxxxxxxxxxx/ > > Bjorn > > [1] https://lore.kernel.org/linux-pci/CAMdYzYpOFAVq30N+O2gOxXiRtpoHpakFg3LKq3TEZq4S6Y0y0g@xxxxxxxxxxxxxx/