Re: [Bug 215525] New: HotPlug does not work on upstream kernel 5.17.0-rc1

Hans de Goede <hdegoede@xxxxxxxxxx> · Tue, 25 Jan 2022 09:58:18 +0100

Hi,

On 1/24/22 22:46, Bjorn Helgaas wrote:
> [+cc linux-pci, Hans, Lukas, Naveen, Keith, Nirmal, Jonathan]
> 
> On Mon, Jan 24, 2022 at 11:46:14AM +0000, bugzilla-daemon@xxxxxxxxxxxxxxxxxxx wrote:
>> https://bugzilla.kernel.org/show_bug.cgi?id=215525
>>
>>             Bug ID: 215525
>>            Summary: HotPlug does not work on upstream kernel 5.17.0-rc1
>>            Product: Drivers
>>            Version: 2.5
>>     Kernel Version: 5.17.0-rc1 upstream
>>           Hardware: x86-64
>>                 OS: Linux
>>               Tree: Mainline
>>             Status: NEW
>>           Severity: normal
>>           Priority: P1
>>          Component: PCI
>>           Assignee: drivers_pci@xxxxxxxxxxxxxxxxxxxx
>>           Reporter: blazej.kucman@xxxxxxxxx
>>         Regression: No
>>
>> Created attachment 300308
>>   --> https://bugzilla.kernel.org/attachment.cgi?id=300308&action=edit
>> dmesg
>>
>> While testing on latest upstream
>> kernel(https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/) we
>> noticed that with the merge commit
>> (https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d0a231f01e5b25bacd23e6edc7c979a18a517b2b)
>> hotplug and hotunplug of nvme drives stopped working.
>>
>> Rescan PCI does not help.
>> echo "1" > /sys/bus/pci/rescan
>>
>> Issue does not reproduce on a kernel built on an antecedent
>> commit(88db8458086b1dcf20b56682504bdb34d2bca0e2).
>>
>>
>> During hot-remove device does not disappear, however when we try to do I/O on
>> the disk then there is an I/O error, and the device disappears.
>>
>> Before I/O no logs regarding the disk appeared in the dmesg, only after I/O the
>> entries appeared like below:
>> [  177.943703] nvme nvme5: controller is down; will reset: CSTS=0xffffffff,
>> PCI_STATUS=0xffff
>> [  177.971661] nvme 10000:0b:00.0: can't change power state from D3cold to D0
>> (config space inaccessible)
>> [  177.981121] pcieport 10000:00:02.0: can't derive routing for PCI INT A
>> [  177.987749] nvme 10000:0b:00.0: PCI INT A: no GSI
>> [  177.992633] nvme nvme5: Removing after probe failure status: -19
>> [  178.004633] nvme5n1: detected capacity change from 83984375 to 0
>> [  178.004677] I/O error, dev nvme5n1, sector 0 op 0x0:(READ) flags 0x0
>> phys_seg 1 prio class 0
>>
>>
>> OS: RHEL 8.4 GA
>> Platform: Intel Purley
>>
>> The logs are collected on a non-recent upstream kernel, but a issue also occurs
>> on the newest upstream kernel(dd81e1c7d5fb126e5fbc5c9e334d7b3ec29a16a0)
> 
> Apparently worked immediately before merging the PCI changes for
> v5.17 and failed immediately after:
> 
>   good: 88db8458086b ("Merge tag 'exfat-for-5.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/linkinjeon/exfat")
>   bad:  d0a231f01e5b ("Merge tag 'pci-v5.17-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci")
> 
> Only three commits touch pciehp:
> 
>   085a9f43433f ("PCI: pciehp: Use down_read/write_nested(reset_lock) to fix lockdep errors")
>   23584c1ed3e1 ("PCI: pciehp: Fix infinite loop in IRQ handler upon power fault")
>   a3b0f10db148 ("PCI: pciehp: Use PCI_POSSIBLE_ERROR() to check config reads")
> 
> None seems obviously related to me.  Blazej, could you try setting
> CONFIG_DYNAMIC_DEBUG=y and booting with 'dyndbg="file pciehp* +p"' to
> enable more debug messages?

Since there are only 3 commits maybe try reverting them 1 by 1 in reverse history order
(so revert latest commit first) ? And see if running a kernel with the reverted commit(s)
fixes things ?

Regards,

Hans