Hi, On 1/24/22 22:46, Bjorn Helgaas wrote: > [+cc linux-pci, Hans, Lukas, Naveen, Keith, Nirmal, Jonathan] > > On Mon, Jan 24, 2022 at 11:46:14AM +0000, bugzilla-daemon@xxxxxxxxxxxxxxxxxxx wrote: >> https://bugzilla.kernel.org/show_bug.cgi?id=215525 >> >> Bug ID: 215525 >> Summary: HotPlug does not work on upstream kernel 5.17.0-rc1 >> Product: Drivers >> Version: 2.5 >> Kernel Version: 5.17.0-rc1 upstream >> Hardware: x86-64 >> OS: Linux >> Tree: Mainline >> Status: NEW >> Severity: normal >> Priority: P1 >> Component: PCI >> Assignee: drivers_pci@xxxxxxxxxxxxxxxxxxxx >> Reporter: blazej.kucman@xxxxxxxxx >> Regression: No >> >> Created attachment 300308 >> --> https://bugzilla.kernel.org/attachment.cgi?id=300308&action=edit >> dmesg >> >> While testing on latest upstream >> kernel(https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/) we >> noticed that with the merge commit >> (https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d0a231f01e5b25bacd23e6edc7c979a18a517b2b) >> hotplug and hotunplug of nvme drives stopped working. >> >> Rescan PCI does not help. >> echo "1" > /sys/bus/pci/rescan >> >> Issue does not reproduce on a kernel built on an antecedent >> commit(88db8458086b1dcf20b56682504bdb34d2bca0e2). >> >> >> During hot-remove device does not disappear, however when we try to do I/O on >> the disk then there is an I/O error, and the device disappears. >> >> Before I/O no logs regarding the disk appeared in the dmesg, only after I/O the >> entries appeared like below: >> [ 177.943703] nvme nvme5: controller is down; will reset: CSTS=0xffffffff, >> PCI_STATUS=0xffff >> [ 177.971661] nvme 10000:0b:00.0: can't change power state from D3cold to D0 >> (config space inaccessible) >> [ 177.981121] pcieport 10000:00:02.0: can't derive routing for PCI INT A >> [ 177.987749] nvme 10000:0b:00.0: PCI INT A: no GSI >> [ 177.992633] nvme nvme5: Removing after probe failure status: -19 >> [ 178.004633] nvme5n1: detected capacity change from 83984375 to 0 >> [ 178.004677] I/O error, dev nvme5n1, sector 0 op 0x0:(READ) flags 0x0 >> phys_seg 1 prio class 0 >> >> >> OS: RHEL 8.4 GA >> Platform: Intel Purley >> >> The logs are collected on a non-recent upstream kernel, but a issue also occurs >> on the newest upstream kernel(dd81e1c7d5fb126e5fbc5c9e334d7b3ec29a16a0) > > Apparently worked immediately before merging the PCI changes for > v5.17 and failed immediately after: > > good: 88db8458086b ("Merge tag 'exfat-for-5.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/linkinjeon/exfat") > bad: d0a231f01e5b ("Merge tag 'pci-v5.17-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci") > > Only three commits touch pciehp: > > 085a9f43433f ("PCI: pciehp: Use down_read/write_nested(reset_lock) to fix lockdep errors") > 23584c1ed3e1 ("PCI: pciehp: Fix infinite loop in IRQ handler upon power fault") > a3b0f10db148 ("PCI: pciehp: Use PCI_POSSIBLE_ERROR() to check config reads") > > None seems obviously related to me. Blazej, could you try setting > CONFIG_DYNAMIC_DEBUG=y and booting with 'dyndbg="file pciehp* +p"' to > enable more debug messages? Since there are only 3 commits maybe try reverting them 1 by 1 in reverse history order (so revert latest commit first) ? And see if running a kernel with the reverted commit(s) fixes things ? Regards, Hans