On 3/2/21 9:34 PM, Williams, Dan J wrote:
[ Add Sathya ]
On Mon, 2021-01-04 at 15:02 -0800, Keith Busch wrote:
Overwriting the frozen detected status with the result of the link reset
loses the NEED_RESET result that drivers are depending on for error
handling to report the .slot_reset() callback. Retain this status so
that subsequent error handling has the correct flow.
Reported-by: Hinko Kocevar <hinko.kocevar@xxxxxx>
Acked-by: Sean V Kelley <sean.v.kelley@xxxxxxxxx>
Signed-off-by: Keith Busch <kbusch@xxxxxxxxxx>
Just want to report that this fix might be a candidate for -stable.
Agree.
I think it can be merged in both stable and mainline kernels.
Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@xxxxxxxxxxxxxxx>
Recent DPC error recovery testing on v5.10 shows that losing this
status prevents NVME from restarting the queue after error recovery
with a failing signature like:
[ 424.685179] pcie_do_recovery: pcieport 0000:97:02.0: AER: broadcast error_detected message
[ 424.685185] nvme nvme0: frozen state error detected, reset controller
[ 425.078620] pcie_do_recovery: pcieport 0000:97:02.0: AER: broadcast resume message
[ 425.078674] pcieport 0000:97:02.0: AER: device recovery successful
...and then later:
[ 751.146277] sysrq: Show Blocked State
[ 751.146431] task:touch state:D stack: 0 pid:11969 ppid: 11413 flags:0x00000000
[ 751.146434] Call Trace:
[ 751.146443] __schedule+0x31b/0x890
[ 751.146445] schedule+0x3c/0xa0
[ 751.146449] blk_queue_enter+0x151/0x220
[ 751.146454] ? finish_wait+0x80/0x80
[ 751.146455] submit_bio_noacct+0x39a/0x410
[ 751.146457] submit_bio+0x42/0x150
[ 751.146460] ? bio_add_page+0x62/0x90
[ 751.146461] ? guard_bio_eod+0x25/0x70
[ 751.146465] submit_bh_wbc+0x16d/0x190
[ 751.146469] ext4_read_bh+0x47/0xa0
[ 751.146472] ext4_read_inode_bitmap+0x3cd/0x5f0
...because NVME was only told to disable, never to reset and resume the
block queue.
I have not tested it, but by code inspection I eventually found this
upstream fix.
--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer