Re: [Bug 215525] New: HotPlug does not work on upstream kernel 5.17.0-rc1

Bjorn Helgaas <helgaas@xxxxxxxxxx> · Mon, 24 Jan 2022 15:46:35 -0600

[+cc linux-pci, Hans, Lukas, Naveen, Keith, Nirmal, Jonathan]

On Mon, Jan 24, 2022 at 11:46:14AM +0000, bugzilla-daemon@xxxxxxxxxxxxxxxxxxx wrote:
> https://bugzilla.kernel.org/show_bug.cgi?id=215525
> 
>             Bug ID: 215525
>            Summary: HotPlug does not work on upstream kernel 5.17.0-rc1
>            Product: Drivers
>            Version: 2.5
>     Kernel Version: 5.17.0-rc1 upstream
>           Hardware: x86-64
>                 OS: Linux
>               Tree: Mainline
>             Status: NEW
>           Severity: normal
>           Priority: P1
>          Component: PCI
>           Assignee: drivers_pci@xxxxxxxxxxxxxxxxxxxx
>           Reporter: blazej.kucman@xxxxxxxxx
>         Regression: No
> 
> Created attachment 300308
>   --> https://bugzilla.kernel.org/attachment.cgi?id=300308&action=edit
> dmesg
> 
> While testing on latest upstream
> kernel(https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/) we
> noticed that with the merge commit
> (https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d0a231f01e5b25bacd23e6edc7c979a18a517b2b)
> hotplug and hotunplug of nvme drives stopped working.
> 
> Rescan PCI does not help.
> echo "1" > /sys/bus/pci/rescan
> 
> Issue does not reproduce on a kernel built on an antecedent
> commit(88db8458086b1dcf20b56682504bdb34d2bca0e2).
> 
> 
> During hot-remove device does not disappear, however when we try to do I/O on
> the disk then there is an I/O error, and the device disappears.
> 
> Before I/O no logs regarding the disk appeared in the dmesg, only after I/O the
> entries appeared like below:
> [  177.943703] nvme nvme5: controller is down; will reset: CSTS=0xffffffff,
> PCI_STATUS=0xffff
> [  177.971661] nvme 10000:0b:00.0: can't change power state from D3cold to D0
> (config space inaccessible)
> [  177.981121] pcieport 10000:00:02.0: can't derive routing for PCI INT A
> [  177.987749] nvme 10000:0b:00.0: PCI INT A: no GSI
> [  177.992633] nvme nvme5: Removing after probe failure status: -19
> [  178.004633] nvme5n1: detected capacity change from 83984375 to 0
> [  178.004677] I/O error, dev nvme5n1, sector 0 op 0x0:(READ) flags 0x0
> phys_seg 1 prio class 0
> 
> 
> OS: RHEL 8.4 GA
> Platform: Intel Purley
> 
> The logs are collected on a non-recent upstream kernel, but a issue also occurs
> on the newest upstream kernel(dd81e1c7d5fb126e5fbc5c9e334d7b3ec29a16a0)

Apparently worked immediately before merging the PCI changes for
v5.17 and failed immediately after:

  good: 88db8458086b ("Merge tag 'exfat-for-5.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/linkinjeon/exfat")
  bad:  d0a231f01e5b ("Merge tag 'pci-v5.17-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci")

Only three commits touch pciehp:

  085a9f43433f ("PCI: pciehp: Use down_read/write_nested(reset_lock) to fix lockdep errors")
  23584c1ed3e1 ("PCI: pciehp: Fix infinite loop in IRQ handler upon power fault")
  a3b0f10db148 ("PCI: pciehp: Use PCI_POSSIBLE_ERROR() to check config reads")

None seems obviously related to me.  Blazej, could you try setting
CONFIG_DYNAMIC_DEBUG=y and booting with 'dyndbg="file pciehp* +p"' to
enable more debug messages?

Bjorn