[+cc linux-pci, Hans, Lukas, Naveen, Keith, Nirmal, Jonathan] On Mon, Jan 24, 2022 at 11:46:14AM +0000, bugzilla-daemon@xxxxxxxxxxxxxxxxxxx wrote: > https://bugzilla.kernel.org/show_bug.cgi?id=215525 > > Bug ID: 215525 > Summary: HotPlug does not work on upstream kernel 5.17.0-rc1 > Product: Drivers > Version: 2.5 > Kernel Version: 5.17.0-rc1 upstream > Hardware: x86-64 > OS: Linux > Tree: Mainline > Status: NEW > Severity: normal > Priority: P1 > Component: PCI > Assignee: drivers_pci@xxxxxxxxxxxxxxxxxxxx > Reporter: blazej.kucman@xxxxxxxxx > Regression: No > > Created attachment 300308 > --> https://bugzilla.kernel.org/attachment.cgi?id=300308&action=edit > dmesg > > While testing on latest upstream > kernel(https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/) we > noticed that with the merge commit > (https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d0a231f01e5b25bacd23e6edc7c979a18a517b2b) > hotplug and hotunplug of nvme drives stopped working. > > Rescan PCI does not help. > echo "1" > /sys/bus/pci/rescan > > Issue does not reproduce on a kernel built on an antecedent > commit(88db8458086b1dcf20b56682504bdb34d2bca0e2). > > > During hot-remove device does not disappear, however when we try to do I/O on > the disk then there is an I/O error, and the device disappears. > > Before I/O no logs regarding the disk appeared in the dmesg, only after I/O the > entries appeared like below: > [ 177.943703] nvme nvme5: controller is down; will reset: CSTS=0xffffffff, > PCI_STATUS=0xffff > [ 177.971661] nvme 10000:0b:00.0: can't change power state from D3cold to D0 > (config space inaccessible) > [ 177.981121] pcieport 10000:00:02.0: can't derive routing for PCI INT A > [ 177.987749] nvme 10000:0b:00.0: PCI INT A: no GSI > [ 177.992633] nvme nvme5: Removing after probe failure status: -19 > [ 178.004633] nvme5n1: detected capacity change from 83984375 to 0 > [ 178.004677] I/O error, dev nvme5n1, sector 0 op 0x0:(READ) flags 0x0 > phys_seg 1 prio class 0 > > > OS: RHEL 8.4 GA > Platform: Intel Purley > > The logs are collected on a non-recent upstream kernel, but a issue also occurs > on the newest upstream kernel(dd81e1c7d5fb126e5fbc5c9e334d7b3ec29a16a0) Apparently worked immediately before merging the PCI changes for v5.17 and failed immediately after: good: 88db8458086b ("Merge tag 'exfat-for-5.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/linkinjeon/exfat") bad: d0a231f01e5b ("Merge tag 'pci-v5.17-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci") Only three commits touch pciehp: 085a9f43433f ("PCI: pciehp: Use down_read/write_nested(reset_lock) to fix lockdep errors") 23584c1ed3e1 ("PCI: pciehp: Fix infinite loop in IRQ handler upon power fault") a3b0f10db148 ("PCI: pciehp: Use PCI_POSSIBLE_ERROR() to check config reads") None seems obviously related to me. Blazej, could you try setting CONFIG_DYNAMIC_DEBUG=y and booting with 'dyndbg="file pciehp* +p"' to enable more debug messages? Bjorn