[ Add Sathya ] On Mon, 2021-01-04 at 15:02 -0800, Keith Busch wrote: > Overwriting the frozen detected status with the result of the link reset > loses the NEED_RESET result that drivers are depending on for error > handling to report the .slot_reset() callback. Retain this status so > that subsequent error handling has the correct flow. > > Reported-by: Hinko Kocevar <hinko.kocevar@xxxxxx> > Acked-by: Sean V Kelley <sean.v.kelley@xxxxxxxxx> > Signed-off-by: Keith Busch <kbusch@xxxxxxxxxx> Just want to report that this fix might be a candidate for -stable. Recent DPC error recovery testing on v5.10 shows that losing this status prevents NVME from restarting the queue after error recovery with a failing signature like: [ 424.685179] pcie_do_recovery: pcieport 0000:97:02.0: AER: broadcast error_detected message [ 424.685185] nvme nvme0: frozen state error detected, reset controller [ 425.078620] pcie_do_recovery: pcieport 0000:97:02.0: AER: broadcast resume message [ 425.078674] pcieport 0000:97:02.0: AER: device recovery successful ...and then later: [ 751.146277] sysrq: Show Blocked State [ 751.146431] task:touch state:D stack: 0 pid:11969 ppid: 11413 flags:0x00000000 [ 751.146434] Call Trace: [ 751.146443] __schedule+0x31b/0x890 [ 751.146445] schedule+0x3c/0xa0 [ 751.146449] blk_queue_enter+0x151/0x220 [ 751.146454] ? finish_wait+0x80/0x80 [ 751.146455] submit_bio_noacct+0x39a/0x410 [ 751.146457] submit_bio+0x42/0x150 [ 751.146460] ? bio_add_page+0x62/0x90 [ 751.146461] ? guard_bio_eod+0x25/0x70 [ 751.146465] submit_bh_wbc+0x16d/0x190 [ 751.146469] ext4_read_bh+0x47/0xa0 [ 751.146472] ext4_read_inode_bitmap+0x3cd/0x5f0 ...because NVME was only told to disable, never to reset and resume the block queue. I have not tested it, but by code inspection I eventually found this upstream fix.