Re: [Bug 215525] New: HotPlug does not work on upstream kernel 5.17.0-rc1

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Mon, 24 Jan 2022 15:46:35 -0600
Bjorn Helgaas <helgaas@xxxxxxxxxx> wrote:

> [+cc linux-pci, Hans, Lukas, Naveen, Keith, Nirmal, Jonathan]
> 
> On Mon, Jan 24, 2022 at 11:46:14AM +0000,
> bugzilla-daemon@xxxxxxxxxxxxxxxxxxx wrote:
> > https://bugzilla.kernel.org/show_bug.cgi?id=215525
> > 
> >             Bug ID: 215525
> >            Summary: HotPlug does not work on upstream kernel
> > 5.17.0-rc1 Product: Drivers
> >            Version: 2.5
> >     Kernel Version: 5.17.0-rc1 upstream
> >           Hardware: x86-64
> >                 OS: Linux
> >               Tree: Mainline
> >             Status: NEW
> >           Severity: normal
> >           Priority: P1
> >          Component: PCI
> >           Assignee: drivers_pci@xxxxxxxxxxxxxxxxxxxx
> >           Reporter: blazej.kucman@xxxxxxxxx
> >         Regression: No
> > 
> > Created attachment 300308  
> >   -->
> > https://bugzilla.kernel.org/attachment.cgi?id=300308&action=edit
> > dmesg
> > 
> > While testing on latest upstream
> > kernel(https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/)
> > we noticed that with the merge commit
> > (https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d0a231f01e5b25bacd23e6edc7c979a18a517b2b)
> > hotplug and hotunplug of nvme drives stopped working.
> > 
> > Rescan PCI does not help.
> > echo "1" > /sys/bus/pci/rescan
> > 
> > Issue does not reproduce on a kernel built on an antecedent
> > commit(88db8458086b1dcf20b56682504bdb34d2bca0e2).
> > 
> > 
> > During hot-remove device does not disappear, however when we try to
> > do I/O on the disk then there is an I/O error, and the device
> > disappears.
> > 
> > Before I/O no logs regarding the disk appeared in the dmesg, only
> > after I/O the entries appeared like below:
> > [  177.943703] nvme nvme5: controller is down; will reset:
> > CSTS=0xffffffff, PCI_STATUS=0xffff
> > [  177.971661] nvme 10000:0b:00.0: can't change power state from
> > D3cold to D0 (config space inaccessible)
> > [  177.981121] pcieport 10000:00:02.0: can't derive routing for PCI
> > INT A [  177.987749] nvme 10000:0b:00.0: PCI INT A: no GSI
> > [  177.992633] nvme nvme5: Removing after probe failure status: -19
> > [  178.004633] nvme5n1: detected capacity change from 83984375 to 0
> > [  178.004677] I/O error, dev nvme5n1, sector 0 op 0x0:(READ) flags
> > 0x0 phys_seg 1 prio class 0
> > 
> > 
> > OS: RHEL 8.4 GA
> > Platform: Intel Purley
> > 
> > The logs are collected on a non-recent upstream kernel, but a issue
> > also occurs on the newest upstream
> > kernel(dd81e1c7d5fb126e5fbc5c9e334d7b3ec29a16a0)  
> 
> Apparently worked immediately before merging the PCI changes for
> v5.17 and failed immediately after:
> 
>   good: 88db8458086b ("Merge tag 'exfat-for-5.17-rc1' of
> git://git.kernel.org/pub/scm/linux/kernel/git/linkinjeon/exfat") bad:
>  d0a231f01e5b ("Merge tag 'pci-v5.17-changes' of
> git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci")
> 
> Only three commits touch pciehp:
> 
>   085a9f43433f ("PCI: pciehp: Use down_read/write_nested(reset_lock)
> to fix lockdep errors") 23584c1ed3e1 ("PCI: pciehp: Fix infinite loop
> in IRQ handler upon power fault") a3b0f10db148 ("PCI: pciehp: Use
> PCI_POSSIBLE_ERROR() to check config reads")
> 
> None seems obviously related to me.  Blazej, could you try setting
> CONFIG_DYNAMIC_DEBUG=y and booting with 'dyndbg="file pciehp* +p"' to
> enable more debug messages?
> 

Hi Bjorn,

Thanks for your suggestions. Blazej did some tests and results were
inconclusive. He tested it on two same platforms. On the first one it
didn't work, even if he reverted all suggested patches. On the second
one hotplugs always worked.

He noticed that on first platform where issue has been found initally,
there was boot parameter "pci=nommconf". After adding this parameter
on the second platform, hotplugs stopped working too.

Tested on tag pci-v5.17-changes. He have CONFIG_HOTPLUG_PCI_PCIE
and CONFIG_DYNAMIC_DEBUG enabled in config. He also attached two dmesg
logs to bugzilla with boot parameter 'dyndbg="file pciehp* +p" as
requested. One with "pci=nommconf" and one without.

Issue seems to related to "pci=nommconf" and it is probably caused
by change outside pciehp.

He is currently working on email client setup to answer himself.

Thanks,
Mariusz





[Index of Archives]     [DMA Engine]     [Linux Coverity]     [Linux USB]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [Greybus]

  Powered by Linux