[+cc Keith, Sagi, Hannes, Kai-Heng, +bcc silverspring from bugzilla] On Wed, Nov 01, 2023 at 06:45:41AM -0500, Bjorn Helgaas wrote: > On Tue, Oct 31, 2023 at 03:21:20PM +0700, Bagas Sanjaya wrote: > > I notice a regression report on Bugzilla [1]. Quoting from it: > > > > > On Kernel 6.4 rc1 and higher if you put the Steam Deck into > > > suspend then press the power button again it will not wake up. > > > > > > I don't have a clue as to -why- this commit breaks wake from > > > suspend on steam deck, but it does. Bisected to: > > > > > > ``` > > > 1ad11eafc63ac16e667853bee4273879226d2d1b is the first bad commit > > > commit 1ad11eafc63ac16e667853bee4273879226d2d1b > > > Author: Bjorn Helgaas <bhelgaas@xxxxxxxxxx> > > > Date: Tue Mar 7 14:32:43 2023 -0600 > > > > > > nvme-pci: drop redundant pci_enable_pcie_error_reporting() > > > > > > pci_enable_pcie_error_reporting() enables the device to send ERR_* > > > Messages. Since f26e58bf6f54 ("PCI/AER: Enable error reporting when AER is > > > native"), the PCI core does this for all devices during enumeration, so the > > > driver doesn't need to do it itself. > > > > > > Remove the redundant pci_enable_pcie_error_reporting() call from the > > > driver. Also remove the corresponding pci_disable_pcie_error_reporting() > > > from the driver .remove() path. > > > > > > Note that this only controls ERR_* Messages from the device. An ERR_* > > > Message may cause the Root Port to generate an interrupt, depending on the > > > AER Root Error Command register managed by the AER service driver. > > > > > > Signed-off-by: Bjorn Helgaas <bhelgaas@xxxxxxxxxx> > > > Reviewed-by: Chaitanya Kulkarni <kch@xxxxxxxxxx> > > > Signed-off-by: Christoph Hellwig <hch@xxxxxx> > > > > > > drivers/nvme/host/pci.c | 6 +----- > > > 1 file changed, 1 insertion(+), 5 deletions(-) > > > ``` > > > Reverting that commit by itself on top of 6.5.9 (stable) allows > > > it to wake from suspend properly. > > [1]: https://bugzilla.kernel.org/show_bug.cgi?id=218090 > > Thanks, I requested some dmesg logs and lspci output to help debug > this. silverspring attached lspci output and a dmesg log from v6.8 to the bugzilla and also noted that "pci=noaer" works around the problem. The problem commit is 1ad11eafc63a ("nvme-pci: drop redundant pci_enable_pcie_error_reporting()") (https://git.kernel.org/linus/1ad11eafc63a) 1ad11eafc63a removed pci_disable_pcie_error_reporting() from the nvme_suspend() path, so we now leave the PCIe Device Control error enables set when we didn't before. My theory is that the PCIe link goes down during suspend, which causes an error interrupt, and the interrupt causes a problem on Steam Deck. Maybe there's some BIOS connection. "pci=noaer" would work around this because those error enables would never be set in the first place. I asked reporters to test the debug patches below to disable those error interrupts during suspend. I don't think this would be the *right* fix; if we need to do this, I think it should be done by the PCI core, not by individual drivers. Kai-Heng has been suggesting this for a while for a different scenario. Bjorn commit 60c07557d0cc ("Revert "PCI/AER: Drop unused pci_disable_pcie_error_reporting()"") Author: Bjorn Helgaas <bhelgaas@xxxxxxxxxx> Date: Fri Mar 29 17:54:30 2024 -0500 Revert "PCI/AER: Drop unused pci_disable_pcie_error_reporting()" This reverts commit 69b264df8a412820e98867dbab871c6526c5e5aa. diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c index ac6293c24976..273f9c6f93dd 100644 --- a/drivers/pci/pcie/aer.c +++ b/drivers/pci/pcie/aer.c @@ -243,6 +243,18 @@ static int pci_enable_pcie_error_reporting(struct pci_dev *dev) return pcibios_err_to_errno(rc); } +int pci_disable_pcie_error_reporting(struct pci_dev *dev) +{ + int rc; + + if (!pcie_aer_is_native(dev)) + return -EIO; + + rc = pcie_capability_clear_word(dev, PCI_EXP_DEVCTL, PCI_EXP_AER_FLAGS); + return pcibios_err_to_errno(rc); +} +EXPORT_SYMBOL_GPL(pci_disable_pcie_error_reporting); + int pci_aer_clear_nonfatal_status(struct pci_dev *dev) { int aer = dev->aer_cap; diff --git a/include/linux/aer.h b/include/linux/aer.h index 4b97f38f3fcf..425e5e430e65 100644 --- a/include/linux/aer.h +++ b/include/linux/aer.h @@ -40,9 +40,14 @@ struct aer_capability_regs { int pcie_read_tlp_log(struct pci_dev *dev, int where, struct pcie_tlp_log *log); #if defined(CONFIG_PCIEAER) +int pci_disable_pcie_error_reporting(struct pci_dev *dev); int pci_aer_clear_nonfatal_status(struct pci_dev *dev); int pcie_aer_is_native(struct pci_dev *dev); #else +static inline int pci_disable_pcie_error_reporting(struct pci_dev *dev) +{ + return -EINVAL; +} static inline int pci_aer_clear_nonfatal_status(struct pci_dev *dev) { return -EINVAL; commit 8efb88cf23d4 ("nvme-pci: disable error reporting in nvme_dev_disable()") Author: Bjorn Helgaas <bhelgaas@xxxxxxxxxx> Date: Fri Mar 29 17:52:39 2024 -0500 nvme-pci: disable error reporting in nvme_dev_disable() Debug patch. The PCI core enables error reporting in pci_aer_init() for all devices that advertise AER support. During suspend, nvme_suspend() calls nvme_dev_disable() in several cases. Prior to 1ad11eafc63a, nvme_dev_disable() disabled error reporting. After 1ad11eafc63a, error reporting will remain enabled during suspend. Maybe having error reporting enabled during suspend causes a problem on Steam Deck. "pci=noaer" prevents pci_aer_init() from enabling error reporting, so as long as the BIOS doesn't enable it, error reporting should always be disabled. nvme_suspend nvme_disable_prepare_reset nvme_dev_disable diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c index 8e0bb9692685..2be838b5d1f6 100644 --- a/drivers/nvme/host/pci.c +++ b/drivers/nvme/host/pci.c @@ -5,6 +5,7 @@ */ #include <linux/acpi.h> +#include <linux/aer.h> #include <linux/async.h> #include <linux/blkdev.h> #include <linux/blk-mq.h> @@ -2603,8 +2604,10 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown) nvme_suspend_io_queues(dev); nvme_suspend_queue(dev, 0); pci_free_irq_vectors(pdev); - if (pci_is_enabled(pdev)) + if (pci_is_enabled(pdev)) { + pci_disable_pcie_error_reporting(pdev); pci_disable_device(pdev); + } nvme_reap_pending_cqes(dev); nvme_cancel_tagset(&dev->ctrl);