Hi Bjorn, On Sat, Mar 30, 2024 at 9:47 PM Bjorn Helgaas <helgaas@xxxxxxxxxx> wrote: > > [+cc Keith, Sagi, Hannes, Kai-Heng, +bcc silverspring from bugzilla] > > On Wed, Nov 01, 2023 at 06:45:41AM -0500, Bjorn Helgaas wrote: > > On Tue, Oct 31, 2023 at 03:21:20PM +0700, Bagas Sanjaya wrote: > > > I notice a regression report on Bugzilla [1]. Quoting from it: > > > > > > > On Kernel 6.4 rc1 and higher if you put the Steam Deck into > > > > suspend then press the power button again it will not wake up. > > > > > > > > I don't have a clue as to -why- this commit breaks wake from > > > > suspend on steam deck, but it does. Bisected to: > > > > > > > > ``` > > > > 1ad11eafc63ac16e667853bee4273879226d2d1b is the first bad commit > > > > commit 1ad11eafc63ac16e667853bee4273879226d2d1b > > > > Author: Bjorn Helgaas <bhelgaas@xxxxxxxxxx> > > > > Date: Tue Mar 7 14:32:43 2023 -0600 > > > > > > > > nvme-pci: drop redundant pci_enable_pcie_error_reporting() > > > > > > > > pci_enable_pcie_error_reporting() enables the device to send ERR_* > > > > Messages. Since f26e58bf6f54 ("PCI/AER: Enable error reporting when AER is > > > > native"), the PCI core does this for all devices during enumeration, so the > > > > driver doesn't need to do it itself. > > > > > > > > Remove the redundant pci_enable_pcie_error_reporting() call from the > > > > driver. Also remove the corresponding pci_disable_pcie_error_reporting() > > > > from the driver .remove() path. > > > > > > > > Note that this only controls ERR_* Messages from the device. An ERR_* > > > > Message may cause the Root Port to generate an interrupt, depending on the > > > > AER Root Error Command register managed by the AER service driver. > > > > > > > > Signed-off-by: Bjorn Helgaas <bhelgaas@xxxxxxxxxx> > > > > Reviewed-by: Chaitanya Kulkarni <kch@xxxxxxxxxx> > > > > Signed-off-by: Christoph Hellwig <hch@xxxxxx> > > > > > > > > drivers/nvme/host/pci.c | 6 +----- > > > > 1 file changed, 1 insertion(+), 5 deletions(-) > > > > ``` > > > > > Reverting that commit by itself on top of 6.5.9 (stable) allows > > > > it to wake from suspend properly. > > > > [1]: https://bugzilla.kernel.org/show_bug.cgi?id=218090 > > > > Thanks, I requested some dmesg logs and lspci output to help debug > > this. > > silverspring attached lspci output and a dmesg log from v6.8 to the > bugzilla and also noted that "pci=noaer" works around the problem. > > The problem commit is 1ad11eafc63a ("nvme-pci: drop redundant > pci_enable_pcie_error_reporting()") > (https://git.kernel.org/linus/1ad11eafc63a) > > 1ad11eafc63a removed pci_disable_pcie_error_reporting() from the > nvme_suspend() path, so we now leave the PCIe Device Control error > enables set when we didn't before. My theory is that the PCIe link > goes down during suspend, which causes an error interrupt, and the > interrupt causes a problem on Steam Deck. Maybe there's some BIOS > connection. > > "pci=noaer" would work around this because those error enables would > never be set in the first place. > > I asked reporters to test the debug patches below to disable those > error interrupts during suspend. > > I don't think this would be the *right* fix; if we need to do this, I > think it should be done by the PCI core, not by individual drivers. > Kai-Heng has been suggesting this for a while for a different > scenario. Should I send the patch to mailing list again to stir more discussion? Kai-Heng > > Bjorn > > > commit 60c07557d0cc ("Revert "PCI/AER: Drop unused pci_disable_pcie_error_reporting()"") > Author: Bjorn Helgaas <bhelgaas@xxxxxxxxxx> > Date: Fri Mar 29 17:54:30 2024 -0500 > > Revert "PCI/AER: Drop unused pci_disable_pcie_error_reporting()" > > This reverts commit 69b264df8a412820e98867dbab871c6526c5e5aa. > > > diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c > index ac6293c24976..273f9c6f93dd 100644 > --- a/drivers/pci/pcie/aer.c > +++ b/drivers/pci/pcie/aer.c > @@ -243,6 +243,18 @@ static int pci_enable_pcie_error_reporting(struct pci_dev *dev) > return pcibios_err_to_errno(rc); > } > > +int pci_disable_pcie_error_reporting(struct pci_dev *dev) > +{ > + int rc; > + > + if (!pcie_aer_is_native(dev)) > + return -EIO; > + > + rc = pcie_capability_clear_word(dev, PCI_EXP_DEVCTL, PCI_EXP_AER_FLAGS); > + return pcibios_err_to_errno(rc); > +} > +EXPORT_SYMBOL_GPL(pci_disable_pcie_error_reporting); > + > int pci_aer_clear_nonfatal_status(struct pci_dev *dev) > { > int aer = dev->aer_cap; > diff --git a/include/linux/aer.h b/include/linux/aer.h > index 4b97f38f3fcf..425e5e430e65 100644 > --- a/include/linux/aer.h > +++ b/include/linux/aer.h > @@ -40,9 +40,14 @@ struct aer_capability_regs { > int pcie_read_tlp_log(struct pci_dev *dev, int where, struct pcie_tlp_log *log); > > #if defined(CONFIG_PCIEAER) > +int pci_disable_pcie_error_reporting(struct pci_dev *dev); > int pci_aer_clear_nonfatal_status(struct pci_dev *dev); > int pcie_aer_is_native(struct pci_dev *dev); > #else > +static inline int pci_disable_pcie_error_reporting(struct pci_dev *dev) > +{ > + return -EINVAL; > +} > static inline int pci_aer_clear_nonfatal_status(struct pci_dev *dev) > { > return -EINVAL; > commit 8efb88cf23d4 ("nvme-pci: disable error reporting in nvme_dev_disable()") > Author: Bjorn Helgaas <bhelgaas@xxxxxxxxxx> > Date: Fri Mar 29 17:52:39 2024 -0500 > > nvme-pci: disable error reporting in nvme_dev_disable() > > Debug patch. > > The PCI core enables error reporting in pci_aer_init() for all devices that > advertise AER support. > > During suspend, nvme_suspend() calls nvme_dev_disable() in several cases. > Prior to 1ad11eafc63a, nvme_dev_disable() disabled error reporting. > > After 1ad11eafc63a, error reporting will remain enabled during suspend. > Maybe having error reporting enabled during suspend causes a problem on > Steam Deck. > > "pci=noaer" prevents pci_aer_init() from enabling error reporting, so as > long as the BIOS doesn't enable it, error reporting should always be > disabled. > > nvme_suspend > nvme_disable_prepare_reset > nvme_dev_disable > > > diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c > index 8e0bb9692685..2be838b5d1f6 100644 > --- a/drivers/nvme/host/pci.c > +++ b/drivers/nvme/host/pci.c > @@ -5,6 +5,7 @@ > */ > > #include <linux/acpi.h> > +#include <linux/aer.h> > #include <linux/async.h> > #include <linux/blkdev.h> > #include <linux/blk-mq.h> > @@ -2603,8 +2604,10 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown) > nvme_suspend_io_queues(dev); > nvme_suspend_queue(dev, 0); > pci_free_irq_vectors(pdev); > - if (pci_is_enabled(pdev)) > + if (pci_is_enabled(pdev)) { > + pci_disable_pcie_error_reporting(pdev); > pci_disable_device(pdev); > + } > nvme_reap_pending_cqes(dev); > > nvme_cancel_tagset(&dev->ctrl);