[+cc Rafael, linux-pci, linux-pm] Sorry for the trouble this caused you, and thanks for the report. I completely agree that these messages are not really useful to users. After all your troubleshooting, were you able to do something to make the NVMe device usable? You mention ACPI powering off the device between PCI enumeration and the driver's probe method. Did you open a bug report about that, too? I think we might need to explore that situation to resolve this. The "config space inaccessible" message comes from pci_raw_set_power_state(), and it means we got ~0 when reading the Power Management Control/Status register. That's not a valid value, so I assume the device was in D3cold, where it can't respond to config reads. The fact that we enumerated the device means it was in at least D3hot, where it *can* respond to config reads. PCI cannot put a device into D3cold directly; only ACPI or similar platform code can do that. ----- Forwarded message from bugzilla-daemon@xxxxxxxxxxxxxxxxxxx ----- Date: Tue, 10 Aug 2021 17:27:17 +0000 From: bugzilla-daemon@xxxxxxxxxxxxxxxxxxx To: bjorn@xxxxxxxxxxx Subject: [Bug 214025] New: Better error message for PCI devices killed during boot? Message-ID: <bug-214025-41252@xxxxxxxxxxxxxxxxxxxxxxxxx/> https://bugzilla.kernel.org/show_bug.cgi?id=214025 Bug ID: 214025 Summary: Better error message for PCI devices killed during boot? Product: Drivers Version: 2.5 Kernel Version: 5.13.8 Hardware: All OS: Linux Tree: Mainline Status: NEW Severity: low Priority: P1 Component: PCI Assignee: drivers_pci@xxxxxxxxxxxxxxxxxxxx Reporter: CFSworks@xxxxxxxxx Regression: No Hello, I recently finished troubleshooting an issue where some NVMe SSD on the PCIe bus wasn't being initialized by the driver; the kernel log contained: pci 0000:02:00.0: CLS mismatch (64 != 1020), using 64 bytes ... nvme 0000:02:00.0: can't change power state from D3hot to D0 (config space inaccessible) The problem (which deserves its own bug report) was that ACPI initialization was powering off the device between the time the PCI bus was scanned and the time the driver was probing the device. The CLS value of 1020 came from the register being read as 0xFF (255*4 = 1020) due to the config space being inaccessible. However, to a user who doesn't have full intuition about PCI, neither of these messages is particularly clear about what's really happening. I'd have expected a (WARN/ERR) message saying something more like, "pci 0000:02:00.0: device has unexpectedly disappeared from the bus; removing" implemented either as a check right before driver probing or at key stages of the PCI device fixup process (such as when computing CLS). This check is probably not necessary for hotplugged devices, since major platform power management initialization won't happen between the hotplug event and driver binding, but I strongly believe it's appropriate at boot when other subsystems are liable to interfere with PCI devices. An alternative to removing the device would be to keep it present in sysfs but put it in some other state (D3cold?) and hold off on trying to bind the driver. This hopefully increases the chance that the user sees that the device is present but in an unusual state. Thoughts? -- You may reply to this email to add a comment. You are receiving this mail because: You are watching the assignee of the bug. ----- End forwarded message -----