[+cc Stanislav, Rajat] On Wed, Aug 28, 2024 at 06:17:04AM +0200, Lukas Wunner wrote: > On Tue, Aug 27, 2024 at 06:48:46PM -0500, Bjorn Helgaas wrote: > > @@ -1311,9 +1320,15 @@ static int pci_dev_wait(struct pci_dev *dev, char *reset_type, int timeout) > > return -ENOTTY; > > } > > > > - pci_read_config_dword(dev, PCI_COMMAND, &id); > > - if (!PCI_POSSIBLE_ERROR(id)) > > - break; > > + if (root && root->config_crs_sv) { > > + pci_read_config_dword(dev, PCI_VENDOR_ID, &id); > > + if (!pci_bus_crs_vendor_id(id)) > > + break; > > There was an effort from Amazon back in 2020/2021 to improve CRS support: > > https://lore.kernel.org/linux-pci/20200307172044.29645-1-stanspas@xxxxxxxxxx/ Thanks for reminding me of that, and I'm sorry that series didn't get applied back then because it's very similar to this one. > One suggestion you raised in the subsequent discussion was to use a > 16-bit (word) instead of a 32-bit (dword) read of the Vendor ID > register to avoid issues with devices that don't implement CRS SV > correctly: > > https://lore.kernel.org/linux-pci/20210913160745.GA1329939@bjorn-Precision-5520/ Thanks. Reading that response, I don't understand my point about using a 16-bit read. I mentioned 89665a6a7140 ("PCI: Check only the Vendor ID to identify Configuration Request Retry"), the commit log of which points to http://lkml.kernel.org/r/4729FC36.3040000@xxxxxxxxx, which documents several defective devices that have a Vendor ID of 0x0001. E.g., the Realtek rtl8169 controller mentioned there has Vendor/Device ID of [0001:8168]. Doing either a 16-bit read or a 32-bit read and checking the low 16 bits would incorrectly treat that as a Config RRS completion. So it *looks* to me like we will time out after 60 seconds and conclude the device never became ready: pci_scan_device(..., timeout=60*1000) pci_bus_read_dev_vendor_id pci_bus_generic_read_dev_vendor_id pci_bus_read_config_dword(PCI_VENDOR_ID, l) # <-- # *l == 0x81680001 if (pci_bus_crs_vendor_id(*l)) # 0x81680001 & 0xffff = 0x0001 pci_bus_wait_crs(..., timeout=60*1000) while (pci_bus_crs_vendor_id(*l)) { if (delay > timeout) return false; # device not ready pci_bus_read_config_dword(PCI_VENDOR_ID, l) } That *might* be an argument for doing a 32-bit read and checking for 0xffff0001, since the spec requires all 1's in the extra bytes. But I'm not aware of any complaints about these broken devices with a 0x0001 Vendor ID, and the 89665a6a7140 commit log says there are also defective devices that don't fill the extra bytes with all 1's. So my inclination is to keep the current code that does a 32-bit read and checks only the low 16 bits. > I realize that pci_bus_crs_vendor_id() masks the Device ID bits, > so probably no biggie. Just want to make sure all lessons learned > during previous discussions on this topic are considered. :)