On Fri, Oct 11, 2024 at 09:59:12AM -0400, Gregory Price wrote: > On Thu, Oct 10, 2024 at 05:16:28PM -0500, Bjorn Helgaas wrote: > > On Fri, Oct 04, 2024 at 12:28:28PM -0400, Gregory Price wrote: > > > During initial device probe, the PCI DOE busy bit for some CXL > > > devices may be left set for a longer period than expected by the > > > current driver logic. Despite local comments stating DOE Busy is > > > unlikely to be detected, it appears commonly specifically during > > > boot when CXL devices are being probed. > > > > > > This was observed on a single socket AMD platform with 2 CXL memory > > > expanders attached to the single socket. It was not the case that > > > concurrent accesses were being made, as validated by monitoring > > > mailbox commands on the device side. > > > > > > This behavior has been observed with multiple CXL memory expanders > > > from different vendors - so it appears unrelated to the model. > > > > > > In all observed tests, only a small period of the retry window is > > > actually used - typically only a handful of loop iterations. > > > > > > Polling on the PCI DOE Busy Bit for (at max) one PCI DOE timeout > > > interval (1 second), resolves this issues cleanly. > > > > > > Per PCIe r6.2 sec 6.30.3, the DOE Busy Bit being cleared does not > > > raise an interrupt, so polling is the best option in this scenario. > > > > > > Subsqeuent code in doe_statemachine_work and abort paths also wait > > > for up to 1 PCI DOE timeout interval, so this order of (potential) > > > additional delay is presumed acceptable. > > > > I provisionally applied this to pci/doe for v6.13 with Lukas and > > Jonathan's reviewed-by. > > > > Can we include a sample of any dmesg logging or other errors users > > would see because of this problem? I'll update the commit log with > > any of this information to help users connect an issue with this fix. > > > > The only indication in dmesg you will see is a line like > > [ 24.542625] endpoint6: DOE failed -EBUSY > > produced by cxl_cdat_get_length or cxl_cdat_read_table > > > Do you want an updated patch with the nits fixed? No need, I fixed the nits and added the dmesg line to the commit log. Thank you! Bjorn