> On Tue, May 9, 2023 at 5:19 PM Bjorn Helgaas <helgaas@xxxxxxxxxx> wrote: > > > > Hi Vincenzo, > > > > Thanks for raising this issue. Let's see what we can do to address > > it. > > > > On Tue, May 09, 2023 at 05:39:12PM +0200, Vincenzo Palazzo wrote: > > > Add a configurable delay to the Rockchip PCIe driver to address > > > crashes that occur on some old devices, such as the Pine64 RockPro64. > > > > > > This issue is affecting the ARM community, but there is no > > > upstream solution for it yet. > > > > It sounds like this happens with several endpoints, right? And I > > assume the endpoints work fine in other non-Rockchip systems? If > > that's the case, my guess is the problem is with the Rockchip host > > controller and how it's initialized, not with the endpoints. > > > > The only delays and timeouts I see in the driver now are in > > rockchip_pcie_host_init_port(), where it waits for link training to > > complete. I assume the link training did completely successfully > > since you don't mention either a gen1 or gen2 timeout (although the > > gen2 message is a dev_dbg() that normally wouldn't go to the console). > > > > I don't know that the spec contains a retrain timeout value. Several > > other drivers use 1 second, while rockchip uses 500ms (for example, > > see LINK_RETRAIN_TIMEOUT and LINK_UP_TIMEOUT). > > > > I think we need to understand the issue better before adding a DT > > property and a module parameter. Those are hard for users to deal > > with. If we can figure out a value that works for everybody, it would > > be better to just hard-code it in the driver and use that all the > > time. > > Good Evening, > > The main issue with the rk3399 is the PCIe controller is buggy and > triggers a SoC panic when certain error conditions occur that should > be handled gracefully. One of those conditions is when an endpoint > requests an access to wait and retry later. Many years ago we ran that > issue to ground and with Robin Murphy's help we found that while it's > possible to gracefully handle that condition it required hijacking the > entire arm64 error handling routine. Not exactly scalable for just one > SoC. The configurable waits allow us to program reasonable times for > 90% of the endpoints that come up in the normal amount of time, while > being able to adjust it for the other 10% that do not. Some require > multiple seconds before they return without error. Part of the reason > we don't want to hardcode the wait time is because the probe isn't > handled asynchronously, so the kernel appears to hang while waiting > for the timeout. Yeah, I smell a hardware bug in my code. I hate waiting in this way inside the code, so it's usually wrong when you need to do something like that. During my research, I also found this patch (https://bugzilla.redhat.com/show_bug.cgi?id=2134177) that provides a fix in another possibly cleaner way. But I don't understand the reasoning behind it, so maybe I haven't spent enough time thinking about it. > I'm curious if it's been tested with this series on top: > https://lore.kernel.org/linux-arm-kernel/20230418074700.1083505-8-rick.wertenbroek@xxxxxxxxx/T/ > I'm particularly curious if > [v5,04/11] PCI: rockchip: Add poll and timeout to wait for PHY PLLs to be locked > makes a difference in the behavior. Please test this and see if it > improves the timeouts you need for the endpoints you're testing > against. Mh, I can try to cherry-pick the commit and test it in my own test environment. Currently, I haven't been able to test it due to a lack of hardware, but I'm seeking a way to obtain one. Luckily, I have someone on the Manjaro arm team who can help me test it, so I'll try to do that. Cheers! Vincent.