Re: [PATCH RESEND] nvme-pci: Fix EEH failure on ppc after subsystem reset

Nilay Shroff <nilay@xxxxxxxxxxxxx> · Wed, 6 Mar 2024 16:50:10 +0530

Hi Keith and Christoph,

On 2/27/24 23:59, Keith Busch wrote:
> On Fri, Feb 09, 2024 at 10:32:16AM +0530, Nilay Shroff wrote:
>> If the nvme subsyetm reset causes the loss of communication to the nvme
>> adapter then EEH could potnetially recover the adapter. The detection of
>> comminication loss to the adapter only happens when the nvme driver
>> attempts to read an MMIO register.
>>
>> The nvme subsystem reset command writes 0x4E564D65 to NSSR register and
>> schedule adapter reset.In the case nvme subsystem reset caused the loss
>> of communication to the nvme adapter then either IO timeout event or
>> adapter reset handler could detect it. If IO timeout even could detect
>> loss of communication then EEH handler is able to recover the
>> communication to the adapter. This change was implemented in 651438bb0af5
>> (nvme-pci: Fix EEH failure on ppc). However if the adapter communication
>> loss is detected in nvme reset work handler then EEH is unable to
>> successfully finish the adapter recovery.
>>
>> This patch ensures that,
>> - nvme driver reset handler would observer pci channel was offline after
>>   a failed MMIO read and avoids marking the controller state to DEAD and
>>   thus gives a fair chance to EEH handler to recover the nvme adapter.
>>
>> - if nvme controller is already in RESETTNG state and pci channel frozen
>>   error is detected then  nvme driver pci-error-handler code sends the
>>   correct error code (PCI_ERS_RESULT_NEED_RESET) back to the EEH handler
>>   so that EEH handler could proceed with the pci slot reset.
> 
> A subsystem reset takes the link down. I'm pretty sure the proper way to
> recover from it requires pcie hotplug support.

This was working earlier in kernel version 6.0.0. We were able to recover the
NVMe pcie adapater on powerpc after nvme subsystem reset assuming some IO were
in flight when subsysetem reset happens. However starting kernel version 6.1.0 
this is broken. 
I 've found the offending commit 1e866afd4bcd(nvme: ensure subsystem reset is 
single threaded) causing this issue on kernel version 6.1.0 and above. So this
seems to be a regression and the proposed patch help fix this bug.

Please find below logs captured for both working and non-working cases:

Working case (kernel version 6.0.0):
-----------------------------------
# uname -r
6.0.0

# nvme list-subsys
nvme-subsys0 - NQN=nqn.1994-11.com.samsung:nvme:PM1735:2.5-inch:S6EUNA0R500358
               hostnqn=nqn.2014-08.org.nvmexpress:uuid:12b49f6e-0276-4746-b10c-56815b7e6dc2
               iopolicy=numa
\
 +- nvme0 pcie 0018:01:00.0 live

# nvme subsystem-reset /dev/nvme0

# dmesg
<snip>
<snip>

[ 3215.658378] EEH: Recovering PHB#18-PE#10000
[ 3215.658401] EEH: PE location: N/A, PHB location: N/A
[ 3215.658406] EEH: Frozen PHB#18-PE#10000 detected
[ 3215.658409] EEH: Call Trace:
[ 3215.658411] EEH: [c00000000005130c] __eeh_send_failure_event+0x7c/0x160
[ 3215.658577] EEH: [c00000000004a104] eeh_dev_check_failure.part.0+0x254/0x670
[ 3215.658583] EEH: [c0080000044e61bc] nvme_timeout+0x254/0x4f0 [nvme]
[ 3215.658591] EEH: [c00000000078d840] blk_mq_check_expired+0xa0/0x130
[ 3215.658602] EEH: [c00000000079a118] bt_iter+0xf8/0x140
[ 3215.658609] EEH: [c00000000079b29c] blk_mq_queue_tag_busy_iter+0x3cc/0x720
[ 3215.658620] EEH: [c00000000078fe74] blk_mq_timeout_work+0x84/0x1c0
[ 3215.658633] EEH: [c000000000173b08] process_one_work+0x2a8/0x570
[ 3215.658644] EEH: [c000000000173e68] worker_thread+0x98/0x5e0
[ 3215.658655] EEH: [c000000000181504] kthread+0x124/0x130
[ 3215.658666] EEH: [c00000000000cbd4] ret_from_kernel_thread+0x5c/0x64
[ 3215.658672] EEH: This PCI device has failed 5 times in the last hour and will be permanently disabled after 5 failures.
[ 3215.658677] EEH: Notify device drivers to shutdown
[ 3215.658681] EEH: Beginning: 'error_detected(IO frozen)'
[ 3215.658688] PCI 0018:01:00.0#10000: EEH: Invoking nvme->error_detected(IO frozen)
[ 3215.658692] nvme nvme0: frozen state error detected, reset controller
[ 3215.788089] PCI 0018:01:00.0#10000: EEH: nvme driver reports: 'need reset'
[ 3215.788092] EEH: Finished:'error_detected(IO frozen)' with aggregate recovery state:'need reset'
<snip>
<snip>
[ 3215.790666] EEH: Reset without hotplug activity
[ 3218.078715] EEH: Beginning: 'slot_reset'
[ 3218.078729] PCI 0018:01:00.0#10000: EEH: Invoking nvme->slot_reset()
[ 3218.078734] nvme nvme0: restart after slot reset
[ 3218.081088] PCI 0018:01:00.0#10000: EEH: nvme driver reports: 'recovered'
[ 3218.081090] EEH: Finished:'slot_reset' with aggregate recovery state:'recovered'
[ 3218.081099] EEH: Notify device driver to resume
[ 3218.081101] EEH: Beginning: 'resume'
<snip>
[ 3218.161027] EEH: Finished:'resume'
[ 3218.161038] EEH: Recovery successful.

# nvme list-subsys
nvme-subsys0 - NQN=nqn.1994-11.com.samsung:nvme:PM1735:2.5-inch:S6EUNA0R500358
               hostnqn=nqn.2014-08.org.nvmexpress:uuid:12b49f6e-0276-4746-b10c-56815b7e6dc2
               iopolicy=numa
\
 +- nvme0 pcie 0018:01:00.0 live

Non-working case (kernel verion 6.1):
------------------------------------
# uname -r
6.1.0

# nvme list-subsys
nvme-subsys0 - NQN=nqn.1994-11.com.samsung:nvme:PM1735:2.5-inch:S6EUNA0R500358
               hostnqn=nqn.2014-08.org.nvmexpress:uuid:12b49f6e-0276-4746-b10c-56815b7e6dc2
               iopolicy=numa
\
 +- nvme0 pcie 0018:01:00.0 live

# nvme subsystem-reset /dev/nvme0

#dmesg
[  177.578828] EEH: Recovering PHB#18-PE#10000
[  177.578852] EEH: PE location: N/A, PHB location: N/A
[  177.578858] EEH: Frozen PHB#18-PE#10000 detected
[  177.578869] EEH: Call Trace:
[  177.578872] EEH: [c0000000000510bc] __eeh_send_failure_event+0x7c/0x160
[  177.579206] EEH: [c000000000049eb4] eeh_dev_check_failure.part.0+0x254/0x670
[  177.579212] EEH: [c008000004c261cc] nvme_timeout+0x254/0x4e0 [nvme]
[  177.579221] EEH: [c00000000079cb00] blk_mq_check_expired+0xa0/0x130
[  177.579226] EEH: [c0000000007a9628] bt_iter+0xf8/0x140
[  177.579231] EEH: [c0000000007aa79c] blk_mq_queue_tag_busy_iter+0x3bc/0x6e0
[  177.579237] EEH: [c00000000079f324] blk_mq_timeout_work+0x84/0x1c0
[  177.579241] EEH: [c000000000174a28] process_one_work+0x2a8/0x570
[  177.579247] EEH: [c000000000174d88] worker_thread+0x98/0x5e0
[  177.579253] EEH: [c000000000182454] kthread+0x124/0x130
[  177.579257] EEH: [c00000000000cddc] ret_from_kernel_thread+0x5c/0x64
[  177.579263] EEH: This PCI device has failed 1 times in the last hour and will be permanently disabled after 5 failures.
[  177.579269] EEH: Notify device drivers to shutdown
[  177.579272] EEH: Beginning: 'error_detected(IO frozen)'
[  177.579276] PCI 0018:01:00.0#10000: EEH: Invoking nvme->error_detected(IO frozen)
[  177.579279] nvme nvme0: frozen state error detected, reset controller
[  177.658746] nvme 0018:01:00.0: enabling device (0000 -> 0002)
[  177.658967] nvme 0018:01:00.0: iommu: 64-bit OK but direct DMA is limited by 800000800000000
[  177.658982] nvme 0018:01:00.0: iommu: 64-bit OK but direct DMA is limited by 800000800000000
[  177.659059] nvme nvme0: Removing after probe failure status: -19
[  177.698719] PCI 0018:01:00.0#10000: EEH: nvme driver reports: 'need reset'
[  177.698723] EEH: Finished:'error_detected(IO frozen)' with aggregate recovery state:'need reset'
<snip>
<snip>
[  179.999828] EEH: Beginning: 'slot_reset'
[  179.999840] PCI 0018:01:00.0#10000: EEH: no driver
[  179.999842] EEH: Finished:'slot_reset' with aggregate recovery state:'none'
[  179.999848] EEH: Notify device driver to resume
[  179.999850] EEH: Beginning: 'resume'
[  179.999853] PCI 0018:01:00.0#10000: EEH: no driver
<snip>

# nvme list-subsys
<empty>

Thanks,
--Nilay