Re: [net-next v2] net: wwan: t7xx: reset device if suspend fails

Sergey Ryazanov <ryazanov.s.a@xxxxxxxxx> · Mon, 4 Nov 2024 00:02:45 +0200




Hello Bjorn, Linas,

many thanks for clarifying this tricky topic! Hope this information can 
serve a good starting point for Jinjian to develop a proper solution for 
that case.

On 03.11.2024 03:24, Linas Vepstas wrote:
Top-post reply.

What Bjorn says is exactly right. Just one clarifying remark: When PCI
error recovery was designed, it was envisioned for high-end,
high-availability servers so that they could reset a failing device
without forcing a reboot of the OS.  Concepts like suspend/resume did
not perturb even the unconscious sleep of the engineers. Thus,
stepping through error recovery during suspend would be novel,
untested, and perhaps prone to confusion. The error recovery procedure
is trying to reset the device into a fully-powered-on,
fully-functional and connected state. This might be problematic if the
suspend has already walked half-way through power-down. The fact that
error recovery might run a very long fraction of a second after the
error is detected further complicates things. The fact that error
recovery  usually has the driver fully reinitializing the device,
including reloading the firmware, could be a problem if the firmware
is not in RAM or is sitting on a suspended block device. So it all
could be an adventure.

As to testing: I asked one of the guys in the lab how he did it, and
he said he would brush a metal wire against the PCI pins. I winced.
But I guess it's cleaner than pouring coffee on it. Later we developed
a software tool to artificially inject errors.

On Fri, Nov 1, 2024 at 8:52 AM Bjorn Helgaas <helgaas@xxxxxxxxxx> wrote:
On Thu, Oct 31, 2024 at 09:09:30PM +0800, Jinjian Song wrote:
From: Sergey Ryazanov <ryazanov.s.a@xxxxxxxxx>
On 29.10.2024 05:46, Jinjian Song wrote:
From: Sergey Ryazanov <ryazanov.s.a@xxxxxxxxx>
On 22.10.2024 11:43, Jinjian Song wrote:
If driver fails to set the device to suspend, it means that the
device is abnormal. In this case, reset the device to recover
when PCIe device is offline.

Is it a reproducible or a speculative issue? Does the fix
recover modem from a problematic state?

Anyway we need someone more familiar with this hardware (Intel
or MediaTek engineer) to Ack the change to make sure we are not
going to put a system in a more complicated state.

This is a very difficult issue to replicate onece occured and fixed.

The issue occured when driver and device lost the connection. I have
encountered this problem twice so far:
1. During suspend/resume stress test, there was a probabilistic D3L2
time sequence issue with the BIOS, result in PCIe link down, driver
read and write the register of device invalid, so suspend failed.
This issue was eventually fixed in the BIOS and I was able to restore
it through the reset module after reproducing the problem.

2. During idle test, the modem probabilistic hang up, result in PCIe
link down, driver read and write the register of device invalid, so
suspend failed. This issue was eventually fiex in device modem firmware
by adjust a certain power supply voltage, and reset modem as a workround
to restore when the MBIM port command timeout in userspace applycations.

Hardware reset modem to recover was discussed with MTK, and they said
that if we don't want to keep the on-site problem location in case of
suspend failure, we can use the recover solution.
Both the ocurred issues result in the PCIe link issue, driver can't
read and writer the register of WWAN device, so I want to add this
path
to restore, hardware reset modem can recover modem, but using the
pci_channle_offline() as the judgment is my inference.

Thank you for the clarification. Let me summarize what I've understood
from the explanation:
a) there were hardware (firmware) issues,
b) issues already were solved,
c) issues were not directly related to the device suspension procedure,
d) you want to implement a backup plan to make the modem support robust.

If got it right, then I would like to recommend to implement a generic
error handling solution for the PCIe interface. You can check this
document: Documentation/PCI/pci-error-recovery.rst

Yes, got it right.
I want to identify the scenario and then recover by reset device,
otherwise suspend failure will aways prevent the system from suspending
if it occurs.

If a PCIe link goes down here's my understanding of what happens:

   - Writes to the device are silently dropped.

   - Reads from the device return ~0 (PCI_POSSIBLE_ERROR()).

   - If the device is in a slot and pciehp is enabled, a Data Link
     Layer State Changed interrupt will cause pciehp_unconfigure_device()
     to detach the driver and remove the pci_dev.

   - If AER is enabled, a failed access to the device will cause an AER
     interrupt.  If the driver has registered pci_error_handlers, the
     driver callbacks will be called, and based on what the driver
     returns, the PCI core may reset the device.

The pciehp and AER interrupts are *asynchronous* to link down events
and to any driver access to the device, so they may be delayed an
arbitrary amount of time.

Both interrupt paths may lead to the device being marked as "offline".
Obviously this is asynchronous with respect to the driver.

+++ b/drivers/net/wwan/t7xx/t7xx_pci.c
@@ -427,6 +427,10 @@ static int __t7xx_pci_pm_suspend(struct
pci_dev *pdev)
       iowrite32(T7XX_L1_BIT(0), IREG_BASE(t7xx_dev) +
ENABLE_ASPM_LOWPWR);
       atomic_set(&t7xx_dev->md_pm_state, MTK_PM_RESUMED);
       t7xx_pcie_mac_set_int(t7xx_dev, SAP_RGU_INT);
+    if (pci_channel_offline(pdev)) {
+        dev_err(&pdev->dev, "Device offline, reset to recover\n");
+        t7xx_reset_device(t7xx_dev, PLDR);
+    }

This looks like an unreliable way to detect issues.  It only works if
AER is enabled, and the device is only marked "offline" some arbitrary
length of time *after* a driver access to the device has failed.

You can't reliably detect errors on writes to the device.

You can only reliably detect errors on reads from the device by
looking for PCI_POSSIBLE_ERROR().  Obviously ~0 might be a valid value
to read from some registers, so you need device-specific knowledge to
know whether ~0 is valid or indicates an error.

If AER or DPC are enabled, the driver can be *notified* about read
errors and some write errors via pci_error_handlers, but the
notification is long after the error.

--
Sergey