On Friday 18 February 2022 02:53:39 Marek Vasut wrote: > On 2/17/22 14:04, Pali Rohár wrote: > > [...] > > > > > > > Flipping either bit makes no difference, suspend/resume behaves the same and > > > > > > the link always recovers. > > > > > > > > > > Ok, perfect! And what happens without suspend/resume (just in normal > > > > > conditions)? E.g. during active usage of some PCIe card (wifi, sata, etc..). > > > > > > > > PING? Also what lspci see for the root port and card itself during hot reset? > > > > > > If I recall, lspci showed the root port and card. > > > > This is suspicious. Card should not respond to config read requests when > > is in hot reset state. Could you send output of lspci -vvxx of the root > > port and also card during this test? Maybe it is possible that root port > > has broken BRIDGE_CONTROL register and did not put card into Hot Reset > > state? > > Yes, I could set the hardware up again and run more tests, it will take some > time, but I can still do that. > > But before I spend any more time running tests for you here, I have to > admit, it seems to me running all those tests is completely off-topic in > context of these two bugfixes here. I do not think this is off-topic. Issue here is caused when controller is not in L0 state and this test is something which deterministically put controller into non-L0 state (Hot Reset). The best verification of all race conditions and similar timing problems is to to setup scenario in which timing windows can be under full control. Which this can can do. I saw more issues related to PCIe slave errors and I'm feeling that this patch is just hacking one or two consequences and not fixing the source of the problem globally. In most cases slave errors are (incorrectly) reported to CPU when PCIe controller receive UR/CA response from the bus or if controller itself generate UR/CA response for request from CPU. > So maybe it would make sense to stop the discussion here and move it to > separate thread ? > > I have to admit, I also don't quite understand what it is that you're trying > to find out with all those tests. Moreover if this test shows that PCI Bridge registers do not work properly then it is something which must be fixed too. There were more discussions about catching and recovering from ARM CPU aborts and all patches for catching asynchronous exceptions were rejected because they cannot work by their _imprecise_ nature. And also there were discussions (not sure if on ML or IRC) if the PCI core / drivers are the correct place for ARMv7 exceptions / data aborts.