[bugzilla-daemon@xxxxxxxxxxxxxxxxxxx: [Bug 214025] New: Better error message for PCI devices killed during boot?]

Bjorn Helgaas <helgaas@xxxxxxxxxx> · Tue, 10 Aug 2021 16:47:40 -0500

[+cc Rafael, linux-pci, linux-pm]

Sorry for the trouble this caused you, and thanks for the report.

I completely agree that these messages are not really useful to users.
After all your troubleshooting, were you able to do something to make
the NVMe device usable?

You mention ACPI powering off the device between PCI enumeration and
the driver's probe method.  Did you open a bug report about that, too?
I think we might need to explore that situation to resolve this.

The "config space inaccessible" message comes from
pci_raw_set_power_state(), and it means we got ~0 when reading the
Power Management Control/Status register.  That's not a valid value,
so I assume the device was in D3cold, where it can't respond to config
reads.

The fact that we enumerated the device means it was in at least D3hot,
where it *can* respond to config reads.  PCI cannot put a device into
D3cold directly; only ACPI or similar platform code can do that.

----- Forwarded message from bugzilla-daemon@xxxxxxxxxxxxxxxxxxx -----

Date: Tue, 10 Aug 2021 17:27:17 +0000
From: bugzilla-daemon@xxxxxxxxxxxxxxxxxxx
To: bjorn@xxxxxxxxxxx
Subject: [Bug 214025] New: Better error message for PCI devices killed during
	boot?
Message-ID: <bug-214025-41252@xxxxxxxxxxxxxxxxxxxxxxxxx/>

https://bugzilla.kernel.org/show_bug.cgi?id=214025

            Bug ID: 214025
           Summary: Better error message for PCI devices killed during
                    boot?
           Product: Drivers
           Version: 2.5
    Kernel Version: 5.13.8
          Hardware: All
                OS: Linux
              Tree: Mainline
            Status: NEW
          Severity: low
          Priority: P1
         Component: PCI
          Assignee: drivers_pci@xxxxxxxxxxxxxxxxxxxx
          Reporter: CFSworks@xxxxxxxxx
        Regression: No

Hello,

I recently finished troubleshooting an issue where some NVMe SSD on the PCIe
bus wasn't being initialized by the driver; the kernel log contained:

pci 0000:02:00.0: CLS mismatch (64 != 1020), using 64 bytes
...
nvme 0000:02:00.0: can't change power state from D3hot to D0 (config space
inaccessible)

The problem (which deserves its own bug report) was that ACPI initialization
was powering off the device between the time the PCI bus was scanned and the
time the driver was probing the device. The CLS value of 1020 came from the
register being read as 0xFF (255*4 = 1020) due to the config space being
inaccessible. However, to a user who doesn't have full intuition about PCI,
neither of these messages is particularly clear about what's really happening.

I'd have expected a (WARN/ERR) message saying something more like, "pci
0000:02:00.0: device has unexpectedly disappeared from the bus; removing"
implemented either as a check right before driver probing or at key stages of
the PCI device fixup process (such as when computing CLS). This check is
probably not necessary for hotplugged devices, since major platform power
management initialization won't happen between the hotplug event and driver
binding, but I strongly believe it's appropriate at boot when other subsystems
are liable to interfere with PCI devices.

An alternative to removing the device would be to keep it present in sysfs but
put it in some other state (D3cold?) and hold off on trying to bind the driver.
This hopefully increases the chance that the user sees that the device is
present but in an unusual state.

Thoughts?

-- 
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.

----- End forwarded message -----