On Wed, May 10, 2023 at 4:48 PM Bjorn Helgaas <helgaas@xxxxxxxxxx> wrote: > > On Tue, May 09, 2023 at 08:11:29PM -0400, Peter Geis wrote: > > On Tue, May 9, 2023 at 5:19 PM Bjorn Helgaas <helgaas@xxxxxxxxxx> wrote: > > > On Tue, May 09, 2023 at 05:39:12PM +0200, Vincenzo Palazzo wrote: > > > > Add a configurable delay to the Rockchip PCIe driver to address > > > > crashes that occur on some old devices, such as the Pine64 RockPro64. > > > > > > > > This issue is affecting the ARM community, but there is no > > > > upstream solution for it yet. > > > > > > It sounds like this happens with several endpoints, right? And I > > > assume the endpoints work fine in other non-Rockchip systems? If > > > that's the case, my guess is the problem is with the Rockchip host > > > controller and how it's initialized, not with the endpoints. > > > ... > > > The main issue with the rk3399 is the PCIe controller is buggy and > > triggers a SoC panic when certain error conditions occur that should > > be handled gracefully. One of those conditions is when an endpoint > > requests an access to wait and retry later. > > I assume this refers to a Completion with Request Retry Status (RRS)? I'm not sure the full coverage, the test patch from Shawn Lin that allowed the system to handle the errors has the following description: "Native defect prevents this RC far from supporting any response from EP which UR filed is set." > > > Many years ago we ran that issue to ground and with Robin Murphy's > > help we found that while it's possible to gracefully handle that > > condition it required hijacking the entire arm64 error handling > > routine. Not exactly scalable for just one SoC. > > Do you have a pointer to that discussion? The URL might save > repeating the whole exercise and could be useful for the commit log > when we try to resolve this. The link to the patch email is here, the full discussion is pretty easy to follow: https://lore.kernel.org/linux-pci/2a381384-9d47-a7e2-679c-780950cd862d@xxxxxxxxxxxxxx/ Also: https://lore.kernel.org/linux-rockchip/1f180d4b-9e5d-c829-555b-c9750940361e@xxxxxx/T/#m9c9d4a28a0d3aa064864f188b8ee3b16ce107aff > > > The configurable waits allow us to program reasonable times for > > 90% of the endpoints that come up in the normal amount of time, while > > being able to adjust it for the other 10% that do not. Some require > > multiple seconds before they return without error. Part of the reason > > we don't want to hardcode the wait time is because the probe isn't > > handled asynchronously, so the kernel appears to hang while waiting > > for the timeout. > > Is there some way for users to figure out that they would need this > property? Or is it just "if your kernel panics on boot, try > adding or increasing "bus-scan-delay-ms" in your DT? There's a listing of tested cards at: https://wiki.pine64.org/wiki/ROCKPro64_Hardware_compatibility Most cards work fine that don't require a large BAR. PCIe switches are completely dead without the above hack patch. Cards that lie in the middle are ones that expect BIOS / EFI support to initialize, or ones that have complex boot roms and don't initialize quickly. But yes, it's unfortunately going to be "if you panic, increase the delay" unless a more complete database of cards can be generated. Very Respectfully, Peter Geis > > Bjorn