Re: nvme-pci: Disabling device after reset failure: -5 occurs while AER recovery

Tushar Dave <tdave@xxxxxxxxxx> · Fri, 10 Mar 2023 17:45:48 -0800

On 3/10/2023 3:53 PM, Bjorn Helgaas wrote:
[+cc Lukas, beginning of thread:
https://lore.kernel.org/all/de1b20e5-ded1-0aae-2221-f5d470d91015@xxxxxxxxxx/]

On Fri, Mar 10, 2023 at 02:39:19PM -0800, Tushar Dave wrote:
On 3/9/23 09:53, Bjorn Helgaas wrote:
On Wed, Mar 08, 2023 at 07:34:58PM -0800, Tushar Dave wrote:
On 3/7/23 03:59, Sagi Grimberg wrote:
On 3/2/23 02:09, Tushar Dave wrote:
We are observing NVMe device disabled due to reset failure after
injecting Malformed TLP. DPC/AER recovery succeed but NVMe fails.
I tried this on 2 different system and it is 100% reproducible with 6.2
kernel.

On my system, Samsung NVMe SSD Controller PM173X is directly behind the
Broadcom PCIe Switch Downstream Port.
MalformedTLP is injected by changing MaxPayload Size(MPS) of PCIe switch
to 128B (keeping NVMe device MPS 512B).

e.g.
# change MPS of PCIe switch (a9:10.0)
$ setpci -v -s a9:10.0 cap_exp+0x8.w
0000:a9:10.0 (cap 10 @68) @70 = 0857
$ setpci -v -s a9:10.0 cap_exp+0x8.w=0x0817
0000:a9:10.0 (cap 10 @68) @70 0817
$ lspci -s a9:10.0 -vvv | grep -w DevCtl -A 2
           DevCtl:    CorrErr+ NonFatalErr+ FatalErr+ UnsupReq-
               RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
               MaxPayload 128 bytes, MaxReadReq 128 bytes

# run some traffic on nvme (ab:00.0)
$ dd if=/dev/nvme0n1 of=/tmp/test bs=4K
dd: error reading '/dev/nvme0n1': Input/output error
2+0 records in
2+0 records out
8192 bytes (8.2 kB, 8.0 KiB) copied, 0.0115304 s, 710 kB/s

#kernel log:
[  163.034889] pcieport 0000:a5:01.0: EDR: EDR event received
[  163.041671] pcieport 0000:a5:01.0: EDR: Reported EDR dev: 0000:a9:10.0
[  163.049071] pcieport 0000:a9:10.0: DPC: containment event,
status:0x2009 source:0x0000
[  163.058014] pcieport 0000:a9:10.0: DPC: unmasked uncorrectable error
detected
[  163.066081] pcieport 0000:a9:10.0: PCIe Bus Error:
severity=Uncorrected (Fatal), type=Transaction Layer, (Receiver ID)
[  163.078151] pcieport 0000:a9:10.0:   device [1000:c030] error
status/mask=00040000/00180000
[  163.087613] pcieport 0000:a9:10.0:    [18] MalfTLP
(First)
[  163.095281] pcieport 0000:a9:10.0: AER:   TLP Header: 60000080
ab0000ff 00000001 d1fd0000
[  163.104517] pcieport 0000:a9:10.0: AER: broadcast error_detected message
[  163.112095] nvme nvme0: frozen state error detected, reset controller
[  163.150716] nvme0c0n1: I/O Cmd(0x2) @ LBA 16, 32 blocks, I/O Error
(sct 0x3 / sc 0x71)
[  163.159802] I/O error, dev nvme0c0n1, sector 16 op 0x0:(READ) flags
0x4080700 phys_seg 4 prio class 2
[  163.383661] pcieport 0000:a9:10.0: AER: broadcast slot_reset message
[  163.390895] nvme nvme0: restart after slot reset
[  163.396230] nvme 0000:ab:00.0: restoring config space at offset 0x3c
(was 0x100, writing 0x1ff)
[  163.406079] nvme 0000:ab:00.0: restoring config space at offset 0x30
(was 0x0, writing 0xe0600000)
[  163.416212] nvme 0000:ab:00.0: restoring config space at offset 0x10
(was 0x4, writing 0xe0710004)
[  163.426326] nvme 0000:ab:00.0: restoring config space at offset 0xc
(was 0x0, writing 0x8)
[  163.435666] nvme 0000:ab:00.0: restoring config space at offset 0x4
(was 0x100000, writing 0x100546)
[  163.446026] pcieport 0000:a9:10.0: AER: broadcast resume message
[  163.468311] nvme 0000:ab:00.0: saving config space at offset 0x0
(reading 0xa824144d)
[  163.477209] nvme 0000:ab:00.0: saving config space at offset 0x4
(reading 0x100546)
[  163.485876] nvme 0000:ab:00.0: saving config space at offset 0x8
(reading 0x1080200)
[  163.495399] nvme 0000:ab:00.0: saving config space at offset 0xc
(reading 0x8)
[  163.504149] nvme 0000:ab:00.0: saving config space at offset 0x10
(reading 0xe0710004)
[  163.513596] nvme 0000:ab:00.0: saving config space at offset 0x14
(reading 0x0)
[  163.522310] nvme 0000:ab:00.0: saving config space at offset 0x18
(reading 0x0)
[  163.531013] nvme 0000:ab:00.0: saving config space at offset 0x1c
(reading 0x0)
[  163.539704] nvme 0000:ab:00.0: saving config space at offset 0x20
(reading 0x0)
[  163.548353] nvme 0000:ab:00.0: saving config space at offset 0x24
(reading 0x0)
[  163.556983] nvme 0000:ab:00.0: saving config space at offset 0x28
(reading 0x0)
[  163.565615] nvme 0000:ab:00.0: saving config space at offset 0x2c
(reading 0xa80a144d)
[  163.574899] nvme 0000:ab:00.0: saving config space at offset 0x30
(reading 0xe0600000)
[  163.584215] nvme 0000:ab:00.0: saving config space at offset 0x34
(reading 0x40)
[  163.592941] nvme 0000:ab:00.0: saving config space at offset 0x38
(reading 0x0)
[  163.601554] nvme 0000:ab:00.0: saving config space at offset 0x3c
(reading 0x1ff)
[  210.089132] block nvme0n1: no usable path - requeuing I/O
[  223.776595] nvme nvme0: I/O 18 QID 0 timeout, disable controller
[  223.825236] nvme nvme0: Identify Controller failed (-4)
[  223.832145] nvme nvme0: Disabling device after reset failure: -5

At this point the device is not going to recover.

Yes, I agree.

I looked little bit more and found that nvme reset failure and second DPC,
both were due to nvme_slot_reset() restoring MPS as part of
pci_restore_state().

AFAICT, after the first DPC event occurs, nvme device MPS gets changed to
_default_ value 128B (this is likely due to DPC link retraining). However as
part of software AER recovery, nvme_slot_reset() restores device state, and
that brings the nvme device MPS back to 512B. (MPS of PCIe switch a9:10.0
still remains at 128B).

At this point when nvme_reset_ctrl->nvme_reset_work() tries to enable the
device, malformedTLP again getting generated and that causes second DPC,
makes NVMe controller reset to fail as well.

This sounds like the behavior I expect.  IIUC:

    - Switch and NVMe MPS are 512B
    - NVMe config space saved (including MPS=512B)
    - You change Switch MPS to 128B
    - NVMe does DMA with payload > 128B
    - Switch reports Malformed TLP because TLP is larger than its MPS
    - Recovery resets NVMe, which sets MPS to the default of 128B
    - nvme_slot_reset() restores NVMe config space (MPS is now 512B)
    - Subsequent NVMe DMA with payload > 128B repeats cycle

What do you think *should* be happening here?  I don't see a PCI
problem here.  If you change MPS on the Switch without coordinating
with NVMe, things aren't going to work.  Or am I missing something?

I agree this is expected but there are instances where I do _not_ see the
issue occurring. That is due to involvement of pciehp, which add and
configure nvme device - (coordinates MPS with pcie switch, and the new MPS
will get saved too. So future tests of this kind won't reproduce this issue
and that is understood).

IMO though, the result of the test should be consistent.
Either pciehp/DPC should take care of device recovery 100% all the time;
Or we consider nvme recovery failure as an expected result because MPS of
pcie switch got changed without coordinating with nvme.

What do you think?

In the log below, pciehp obviously is enabled; should I infer that in
the log above, it is not?

pciehp is enabled all the time. In the log above and below.
I do not have answer yet why pciehp shows-up only in some tests (due to DPC link 
down/up) and not in others like you noticed in both the logs.

Generally we've avoided handling a device reset as a remove/add event
because upper layers can't deal well with that.  But in the log below
it looks like pciehp *did* treat the DPC containment as a remove/add,
which of course involves configuring the "new" device and its MPS
settings.

yes and that puzzled me why? especially when"Link Down/Up ignored (recovered by 
DPC)". Do we still have race somewhere, I am not sure.

   [  217.071200] pcieport 0000:a9:10.0: AER: broadcast slot_reset message
   [  217.071217] nvme nvme0: restart after slot reset
   [  217.071234] pcieport 0000:a9:10.0: pciehp: Slot(272): Link Down/Up ignored (recovered by DPC)
   [  217.071250] pcieport 0000:a9:10.0: pciehp: pciehp_check_link_active: lnk_status = 2044
   [  217.071259] pcieport 0000:a9:10.0: pciehp: Slot(272): Card not present
   [  217.071267] pcieport 0000:a9:10.0: pciehp: pciehp_unconfigure_device: domain:bus:dev = 0000:ab:00
   [  217.071320] nvme 0000:ab:00.0: restoring config space at offset 0x3c (was 0x100, writing 0x1ff)
   [  217.071451] nvme 0000:ab:00.0: nvme_slot_reset: after pci_restore_state, DEVCTL: 0x5957

The .slot_reset() method (nvme_slot_reset()) is called *after* the
device has been reset, and the device is supposed to be ready for the
driver to use it.  But here it looks like pciehp thinks ab:00.0 is not
present, so it removes it.  Later ab:00.0 is present again, so we
re-enumerate it:

that is correct.

   [  217.311892] pcieport 0000:a9:10.0: pciehp: Slot(272): Card present
   [  217.311897] pcieport 0000:a9:10.0: pciehp: Slot(272): Link Up
   [  217.455159] pcieport 0000:a9:10.0: pciehp: pciehp_check_link_status: lnk_status = 2044
   [  217.455222] pci 0000:ab:00.0: [144d:a824] type 00 class 0x010802

What kernel are you testing?  53b54ad074de ("PCI/DPC: Await readiness
of secondary bus after reset") looks like it could be related, but
you'd have to be using v6.3-rc1 or later to get it.

I am on v6.2 but I will give a try to v6.3-rc1 and get back.

e.g. [ when pciehp takes care of things]

[  216.608538] pcieport 0000:a9:10.0: pciehp: pending interrupts 0x0108 from
Slot Status
[  216.639954] pcieport 0000:a5:01.0: EDR: EDR event received
[  216.640429] pcieport 0000:a5:01.0: EDR: Reported EDR dev: 0000:a9:10.0
[  216.640438] pcieport 0000:a9:10.0: DPC: containment event, status:0x2009
source:0x0000
[  216.640442] pcieport 0000:a9:10.0: DPC: unmasked uncorrectable error detected
[  216.640452] pcieport 0000:a9:10.0: PCIe Bus Error: severity=Uncorrected
(Fatal), type=Transaction Layer, (Receiver ID)
[  216.652549] pcieport 0000:a9:10.0:   device [1000:c030] error
status/mask=00040000/00180000
[  216.661975] pcieport 0000:a9:10.0:    [18] MalfTLP                (First)
[  216.669647] pcieport 0000:a9:10.0: AER:   TLP Header: 60000080 ab0000ff
00000102 276fe000
[  216.678890] pcieport 0000:a9:10.0: AER: broadcast error_detected message
[  216.678898] nvme nvme0: frozen state error detected, reset controller
[  216.842570] nvme0c0n1: I/O Cmd(0x2) @ LBA 16, 32 blocks, I/O Error (sct
0x3 / sc 0x71)
[  216.851684] I/O error, dev nvme0c0n1, sector 16 op 0x0:(READ) flags
0x4080700 phys_seg 4 prio class 2
[  217.071200] pcieport 0000:a9:10.0: AER: broadcast slot_reset message
[  217.071217] nvme nvme0: restart after slot reset
[  217.071228] nvme 0000:ab:00.0: nvme_slot_reset: before pci_restore_state
DEVCTL: 0x2910
[  217.071234] pcieport 0000:a9:10.0: pciehp: Slot(272): Link Down/Up
ignored (recovered by DPC)
[  217.071250] pcieport 0000:a9:10.0: pciehp: pciehp_check_link_active:
lnk_status = 2044
[  217.071259] pcieport 0000:a9:10.0: pciehp: Slot(272): Card not present
[  217.071267] pcieport 0000:a9:10.0: pciehp: pciehp_unconfigure_device:
domain:bus:dev = 0000:ab:00
[  217.071320] nvme 0000:ab:00.0: restoring config space at offset 0x3c (was
0x100, writing 0x1ff)
[  217.071346] nvme 0000:ab:00.0: restoring config space at offset 0x30 (was
0x0, writing 0xe0600000)
[  217.071373] nvme 0000:ab:00.0: restoring config space at offset 0x10 (was
0x4, writing 0xe0710004)
[  217.071383] nvme 0000:ab:00.0: restoring config space at offset 0xc (was
0x0, writing 0x8)
[  217.071394] nvme 0000:ab:00.0: restoring config space at offset 0x4 (was
0x100000, writing 0x100546)
[  217.071451] nvme 0000:ab:00.0: nvme_slot_reset: after pci_restore_state,
DEVCTL: 0x5957
[  217.071464] pcieport 0000:a9:10.0: AER: broadcast resume message
[  217.071467] nvme 0000:ab:00.0: PME# disabled
[  217.071513] pcieport 0000:a9:10.0: AER: device recovery successful
[  217.071522] pcieport 0000:a9:10.0: EDR: DPC port successfully recovered
[  217.071526] nvme 0000:ab:00.0: vgaarb: pci_notify
[  217.071531] pcieport 0000:a5:01.0: EDR: Status for 0000:a9:10.0: 0x80
[  217.071614] nvme nvme0: ctrl state 6 is not RESETTING
[  217.103486] Buffer I/O error on dev nvme0n1, logical block 2, async page read
[  217.308778] pci 0000:ab:00.0: vgaarb: pci_notify
[  217.308831] pci 0000:ab:00.0: vgaarb: pci_notify
[  217.311299] pci 0000:ab:00.0: vgaarb: pci_notify
[  217.311863] pci 0000:ab:00.0: device released
[  217.311887] pcieport 0000:a9:10.0: pciehp: pciehp_check_link_active:
lnk_status = 2044
[  217.311892] pcieport 0000:a9:10.0: pciehp: Slot(272): Card present
[  217.311897] pcieport 0000:a9:10.0: pciehp: Slot(272): Link Up
[  217.455159] pcieport 0000:a9:10.0: pciehp: pciehp_check_link_status:
lnk_status = 2044
[  217.455222] pci 0000:ab:00.0: [144d:a824] type 00 class 0x010802
[  217.455275] pci 0000:ab:00.0: reg 0x10: [mem 0xe0710000-0xe0717fff 64bit]
[  217.455362] pci 0000:ab:00.0: reg 0x30: [mem 0xe0600000-0xe060ffff pref]
[  217.455380] pci 0000:ab:00.0: Max Payload Size set to 128 (was 512, max 512)
[  217.455726] pci 0000:ab:00.0: reg 0x20c: [mem 0xe0610000-0xe0617fff 64bit]
[  217.455732] pci 0000:ab:00.0: VF(n) BAR0 space: [mem
0xe0610000-0xe070ffff 64bit] (contains BAR0 for 32 VFs)
[  217.456307] pci 0000:ab:00.0: vgaarb: pci_notify
[  217.456404] pcieport 0000:a9:10.0: bridge window [io  0x1000-0x0fff] to
[bus ab] add_size 1000
[  217.456413] pcieport 0000:a9:10.0: bridge window [mem
0x00100000-0x000fffff 64bit pref] to [bus ab] add_size 200000 add_align
100000
[  217.456430] pcieport 0000:a9:10.0: BAR 15: no space for [mem size
0x00200000 64bit pref]
[  217.456436] pcieport 0000:a9:10.0: BAR 15: failed to assign [mem size
0x00200000 64bit pref]
[  217.456440] pcieport 0000:a9:10.0: BAR 13: no space for [io  size 0x1000]
[  217.456444] pcieport 0000:a9:10.0: BAR 13: failed to assign [io  size 0x1000]
[  217.456451] pcieport 0000:a9:10.0: BAR 15: no space for [mem size
0x00200000 64bit pref]
[  217.456457] pcieport 0000:a9:10.0: BAR 15: failed to assign [mem size
0x00200000 64bit pref]
[  217.456464] pcieport 0000:a9:10.0: BAR 13: no space for [io  size 0x1000]
[  217.456470] pcieport 0000:a9:10.0: BAR 13: failed to assign [io  size 0x1000]
[  217.456480] pci 0000:ab:00.0: BAR 6: assigned [mem 0xe0600000-0xe060ffff pref]
[  217.456488] pci 0000:ab:00.0: BAR 0: assigned [mem 0xe0610000-0xe0617fff 64bit]
[  217.456509] pci 0000:ab:00.0: BAR 7: assigned [mem 0xe0618000-0xe0717fff 64bit]
[  217.456517] pcieport 0000:a9:10.0: PCI bridge to [bus ab]
[  217.456526] pcieport 0000:a9:10.0:   bridge window [mem 0xe0600000-0xe07fffff]
[  217.456614] nvme 0000:ab:00.0: vgaarb: pci_notify
[  217.456624] nvme 0000:ab:00.0: runtime IRQ mapping not provided by arch
[  217.457452] nvme nvme10: pci function 0000:ab:00.0
[  217.458154] nvme 0000:ab:00.0: saving config space at offset 0x0 (reading
0xa824144d)
[  217.458166] nvme 0000:ab:00.0: saving config space at offset 0x4 (reading
0x100546)
[  217.458173] nvme 0000:ab:00.0: saving config space at offset 0x8 (reading
0x1080200)
[  217.458179] nvme 0000:ab:00.0: saving config space at offset 0xc (reading 0x8)
[  217.458185] nvme 0000:ab:00.0: saving config space at offset 0x10
(reading 0xe0610004)
[  217.458192] nvme 0000:ab:00.0: saving config space at offset 0x14 (reading 0x0)
[  217.458198] nvme 0000:ab:00.0: saving config space at offset 0x18 (reading 0x0)
[  217.458202] nvme 0000:ab:00.0: saving config space at offset 0x1c (reading 0x0)
[  217.458207] nvme 0000:ab:00.0: saving config space at offset 0x20 (reading 0x0)
[  217.458212] nvme 0000:ab:00.0: saving config space at offset 0x24 (reading 0x0)
[  217.458217] nvme 0000:ab:00.0: saving config space at offset 0x28 (reading 0x0)
[  217.458222] nvme 0000:ab:00.0: saving config space at offset 0x2c
(reading 0xa80a144d)
[  217.458227] nvme 0000:ab:00.0: saving config space at offset 0x30
(reading 0xe0600000)
[  217.458237] nvme 0000:ab:00.0: saving config space at offset 0x34 (reading 0x40)
[  217.458242] nvme 0000:ab:00.0: saving config space at offset 0x38 (reading 0x0)
[  217.458247] nvme 0000:ab:00.0: saving config space at offset 0x3c (reading 0x1ff)
[  217.462192] nvme nvme10: Shutdown timeout set to 10 seconds
[  217.520625] nvme nvme10: 63/0/0 default/read/poll queues