Hi, Thorsten here, the Linux kernel's regression tracker. Top-posting for once, to make this easily accessible to everyone. Gwan-gyeong Mun, was this regression ever addressed? Doesn't look like it from here, but I might have missed something. Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat) -- Everything you wanna know about Linux kernel regression tracking: https://linux-regtracking.leemhuis.info/about/#tldr If I did something stupid, please tell me, as explained on that page. #regzbot poke On 07.07.23 16:16, Gwan-gyeong Mun wrote: > > > On 7/6/23 1:01 AM, Bjorn Helgaas wrote: >> On Mon, Jul 03, 2023 at 01:37:42PM +0300, Gwan-gyeong Mun wrote: >>> Since Linux 6.2 kernel (same happens in Linux 6.4.1), loading vfio-pci >>> driver to a specific HW (Intel DG2 A770) target does not work properly. >>> (It works fine on Linux 6.1 kernel with the same HW). >> >> Thank you very much for the report! >> >> Does this problem only happen with vfio-pci? d8d2b65a940b appeared in >> v6.2-rc1 (Dec 25, 2022), so I would think somebody would have used DG2 >> on a v6.2 or newer kernel. >> > Hi Bjorn, > > Yes, the problem only occurred when I set DG2 to vfio-pci as shown below > in the settings [1]. > (The reason for setting DG2 to vfio-pci is to use dg2 as a qemu pci > paththru device). > If you don't set DG2 to vfio-pci, you won't see any logs of the problem. > > >> Can you please collect the complete "sudo lspci -vv" output (not just >> the DG2 items)? We need info about the switch ports and all the >> capabilities, since d8d2b65a940b has to do with switch ports, AER, and >> MSI. >> >> Also, please collect the complete dmesg log with v6.4.1 (which does >> not work) and v6.4.1 with d8d2b65a940b reverted (which should work). >> > > I've filed this issue with kernel bugzilla[2] and added the dmesg and > lspci information you asked about as attachments. > I've also added direct links to the relevant logs below. > > 1. complete dmesg log with v6.4.1 with d8d2b65a940b reverted.[3] > 2. lspci -vv with v6.4.1 with d8d2b65a940b reverted [4] > 3. complete dmesg log with v6.4.1 [5] > 4. lspci -vv with v6.4.1 [6] > > [1] > $ cat /etc/modprobe.d/vfio.conf > > options vfio-pci ids=8086:56a0,8086:4f90 > softdep drm pre: vfio-pci > > [2] https://bugzilla.kernel.org/show_bug.cgi?id=217641 > [3] https://bugzilla.kernel.org/attachment.cgi?id=304560 > [4] https://bugzilla.kernel.org/attachment.cgi?id=304561 > [5] https://bugzilla.kernel.org/attachment.cgi?id=304562 > [6] https://bugzilla.kernel.org/attachment.cgi?id=304563 > > >> I know you said that on v6.4.1 with d8d2b65a940b reverted, the system >> boots but there's still a problem with suspend. I'm intentionally >> ignoring this problem for now. After we figure out the boot-time >> problem with the DG2 being left in D3cold, we can come back to the >> suspend problem. > Yes, I understand, and I agree. > > Br, > > G.G. >> >> Bjorn >> >>> The configuration and hardware information used is described in [1]. >>> >>> Starting with the Linux 6.2 kernel, the following log is output to dmesg >>> when a problem occurs. >>> ... >>> [ 15.049948] pcieport 0000:00:01.0: Data Link Layer Link Active not >>> set in >>> 1000 msec >>> [ 15.050024] pcieport 0000:01:00.0: Unable to change power state from >>> D3cold >>> to D0, device inaccessible >>> [ 15.051067] pcieport 0000:02:01.0: Unable to change power state from >>> D3cold >>> to D0, device inaccessible >>> [ 15.052141] pcieport 0000:02:04.0: Unable to change power state from >>> D3cold >>> to D0, device inaccessible >>> [ 17.286554] vfio-pci 0000:03:00.0: not ready 1023ms after resume; >>> giving up >>> [ 17.286553] vfio-pci 0000:04:00.0: not ready 1023ms after resume; >>> giving up >>> [ 17.286576] vfio-pci 0000:03:00.0: Unable to change power state from >>> D3cold >>> to D0, device inaccessible >>> [ 17.286578] vfio-pci 0000:04:00.0: Unable to change power state from >>> D3cold >>> to D0, device inaccessible >>> ... >>> >>> And if you check the DG2 hardware using the "lspci -nnv" command, you >>> will >>> see that "Flags:" is displayed as "!!! Unknown header type 7f" as shown >>> below. [2] >>> The normal output log looks like [3]. >>> >>> This issue has been occurring since the patch below was applied. [4] >>> >>> d8d2b65a940bb497749d66bdab59b530901d3854 is the first bad commit >>> commit d8d2b65a940bb497749d66bdab59b530901d3854 >>> Author: Bjorn Helgaas <bhelgaas@xxxxxxxxxx> >>> Date: Fri Dec 9 11:01:00 2022 -0600 >>> >>> PCI/portdrv: Allow AER service only for Root Ports & RCECs >>> >>> >>> Rolling back the [4] patch makes it work on boot with the latest >>> version of >>> the kernel, but the same problem still occurs after "suspend to s2idle". >>> This problem existed even before applying [4]. >>> >>> Suspend has been tested with the following command. >>> $ systemctl suspend -i >>> >>> $ cat /sys/power/mem_sleep >>> [s2idle] deep >>> >>> >>> Here is the log that is issued when testing suspend to s2idle. [5] >>> >>> >>> Br, >>> >>> G.G. >>> >>> >>> [1] Env: >>> >>> NUC: intel-nuc-13-extreme-kit-nuc13rngi7 >>> (https://ark.intel.com/content/www/us/en/ark/products/229784/intel-nuc-13-extreme-kit-nuc13rngi7.html) >>> (MB: Z690, CPU: RPL-S i13700k) >>> >>> PCIE Card: Intel A770 GPU >>> >>> Add boot parameter: intel_iommu=on iommu=pt >>> >>> $ lspci -nn |grep DG2 >>> 03:00.0 VGA compatible controller [0300]: Intel Corporation DG2 [Arc >>> A770] >>> [8086:56a0] (rev 08) >>> 04:00.0 Audio device [0403]: Intel Corporation DG2 Audio Controller >>> [8086:4f90] >>> >>> >>> $ cat /etc/modprobe.d/vfio.conf >>> >>> options vfio-pci ids=8086:56a0,8086:4f90 >>> softdep drm pre: vfio-pci >>> >>> [2] >>> 03:00.0 VGA compatible controller [0300]: Intel Corporation DG2 [Arc >>> A770] >>> [8086:56a0] (rev 08) (prog-if 00 [VGA controller]) >>> Subsystem: Intel Corporation Device [8086:1020] >>> !!! Unknown header type 7f >>> Memory at 93000000 (64-bit, non-prefetchable) [size=16M] >>> Memory at 6000000000 (64-bit, prefetchable) [size=16G] >>> Expansion ROM at 94000000 [disabled] [size=2M] >>> Kernel driver in use: vfio-pci >>> Kernel modules: i915 >>> >>> 04:00.0 Audio device [0403]: Intel Corporation DG2 Audio Controller >>> [8086:4f90] >>> Subsystem: Intel Corporation Device [8086:1020] >>> !!! Unknown header type 7f >>> Memory at 94300000 (64-bit, non-prefetchable) [size=16K] >>> Kernel driver in use: vfio-pci >>> Kernel modules: snd_hda_intel >>> >>> >>> [3] >>> 03:00.0 VGA compatible controller [0300]: Intel Corporation DG2 [Arc >>> A770] >>> [8086:56a0] (rev 08) (prog-if 00 [VGA controller]) >>> Subsystem: Intel Corporation Device [8086:1020] >>> Flags: bus master, fast devsel, latency 0, IOMMU group 19 >>> Memory at 93000000 (64-bit, non-prefetchable) [size=16M] >>> Memory at 6000000000 (64-bit, prefetchable) [size=16G] >>> Expansion ROM at 94000000 [disabled] [size=2M] >>> Capabilities: <access denied> >>> Kernel driver in use: vfio-pci >>> Kernel modules: i915 >>> >>> 04:00.0 Audio device [0403]: Intel Corporation DG2 Audio Controller >>> [8086:4f90] >>> Subsystem: Intel Corporation Device [8086:1020] >>> Flags: fast devsel, IOMMU group 20 >>> Memory at 94300000 (64-bit, non-prefetchable) [disabled] [size=16K] >>> Capabilities: <access denied> >>> Kernel driver in use: vfio-pci >>> Kernel modules: snd_hda_intel >>> >>> >>> [4] >>> commit d8d2b65a940bb497749d66bdab59b530901d3854 >>> Author: Bjorn Helgaas <bhelgaas@xxxxxxxxxx> >>> Date: Fri Dec 9 11:01:00 2022 -0600 >>> >>> PCI/portdrv: Allow AER service only for Root Ports & RCECs >>> >>> Previously portdrv allowed the AER service for any device with >>> an AER >>> capability (assuming Linux had control of AER) even though the AER >>> service >>> driver only attaches to Root Port and RCECs. >>> >>> Because get_port_device_capability() included AER for non-RP, >>> non-RCEC >>> devices, we tried to initialize the AER IRQ even though these >>> devices >>> don't generate AER interrupts. >>> >>> Intel DG1 and DG2 discrete graphics cards contain a switch >>> leading to a >>> GPU. The switch supports AER but not MSI, so initializing an >>> AER IRQ >>> failed, and portdrv failed to claim the switch port at all. The >>> GPU >>> itself >>> could be suspended, but the switch could not be put in a >>> low-power state >>> because it had no driver. >>> >>> Don't allow the AER service on non-Root Port, non-Root Complex >>> Event >>> Collector devices. This means we won't enable Bus Mastering if the >>> device >>> doesn't require MSI, the AER service will not appear in sysfs, >>> and the >>> AER >>> service driver will not bind to the device. >>> >>> Link: >>> https://lore.kernel.org/r/20221207084105.84947-1-mika.westerberg@xxxxxxxxxxxxxxx >>> Link: >>> https://lore.kernel.org/r/20221210002922.1749403-1-helgaas@xxxxxxxxxx >>> Based-on-patch-by: Mika Westerberg >>> <mika.westerberg@xxxxxxxxxxxxxxx> >>> Signed-off-by: Bjorn Helgaas <bhelgaas@xxxxxxxxxx> >>> Reviewed-by: Kuppuswamy Sathyanarayanan >>> <sathyanarayanan.kuppuswamy@xxxxxxxxxxxxxxx> >>> >>> diff --git a/drivers/pci/pcie/portdrv.c b/drivers/pci/pcie/portdrv.c >>> index a6c4225505d5..8b16e96ec15c 100644 >>> --- a/drivers/pci/pcie/portdrv.c >>> +++ b/drivers/pci/pcie/portdrv.c >>> @@ -232,7 +232,9 @@ static int get_port_device_capability(struct pci_dev >>> *dev) >>> } >>> >>> #ifdef CONFIG_PCIEAER >>> - if (dev->aer_cap && pci_aer_available() && >>> + if ((pci_pcie_type(dev) == PCI_EXP_TYPE_ROOT_PORT || >>> + pci_pcie_type(dev) == PCI_EXP_TYPE_RC_EC) && >>> + dev->aer_cap && pci_aer_available() && >>> (pcie_ports_native || host->native_aer)) >>> services |= PCIE_PORT_SERVICE_AER; >>> #endif >>> >>> >>> [5] >>> [ 71.995824] PM: suspend entry (s2idle) >>> [ 72.000793] Filesystems sync: 0.004 seconds >>> [ 72.153926] Freezing user space processes >>> [ 72.156234] Freezing user space processes completed (elapsed 0.002 >>> seconds) >>> [ 72.156244] OOM killer disabled. >>> [ 72.156246] Freezing remaining freezable tasks >>> [ 72.157616] Freezing remaining freezable tasks completed (elapsed >>> 0.001 >>> seconds) >>> [ 72.157619] printk: Suspending console(s) (use no_console_suspend to >>> debug) >>> [ 73.756457] ACPI: EC: interrupt blocked >>> [ 75.103988] ucsi_acpi USBC000:00: ucsi_handle_connector_change: >>> GET_CONNECTOR_STATUS failed (-5) >>> [ 84.052478] ACPI: EC: interrupt unblocked >>> [ 86.085388] pcieport 0000:00:01.0: Data Link Layer Link Active not >>> set in >>> 1000 msec >>> [ 86.085523] pcieport 0000:01:00.0: Unable to change power state from >>> D3cold to D0, device inaccessible >>> [ 86.086984] pci 0000:02:01.0: Unable to change power state from >>> D3cold to >>> D0, device inaccessible >>> [ 86.087005] pci 0000:02:04.0: Unable to change power state from >>> D3cold to >>> D0, device inaccessible >>> [ 88.335403] vfio-pci 0000:04:00.0: not ready 1023ms after resume; >>> waiting >>> [ 88.335427] vfio-pci 0000:03:00.0: not ready 1023ms after resume; >>> waiting >>> [ 89.375444] vfio-pci 0000:04:00.0: not ready 2047ms after resume; >>> waiting >>> [ 89.375471] vfio-pci 0000:03:00.0: not ready 2047ms after resume; >>> waiting >>> [ 91.615418] vfio-pci 0000:04:00.0: not ready 4095ms after resume; >>> waiting >>> [ 91.615439] vfio-pci 0000:03:00.0: not ready 4095ms after resume; >>> waiting >>> [ 95.882059] vfio-pci 0000:04:00.0: not ready 8191ms after resume; >>> waiting >>> [ 95.882081] vfio-pci 0000:03:00.0: not ready 8191ms after resume; >>> waiting >>> [ 104.202062] vfio-pci 0000:04:00.0: not ready 16383ms after resume; >>> waiting >>> [ 104.202066] vfio-pci 0000:03:00.0: not ready 16383ms after resume; >>> waiting >>> [ 121.482058] vfio-pci 0000:04:00.0: not ready 32767ms after resume; >>> waiting >>> [ 121.482067] vfio-pci 0000:03:00.0: not ready 32767ms after resume; >>> waiting >>> [ 155.615409] vfio-pci 0000:04:00.0: not ready 65535ms after resume; >>> giving >>> up >>> [ 155.615412] vfio-pci 0000:03:00.0: not ready 65535ms after resume; >>> giving >>> up >>> [ 155.633757] i915 0000:00:02.0: [drm] GT0: GuC firmware >>> i915/tgl_guc_70.bin version 70.5.1 >>> [ 155.633761] i915 0000:00:02.0: [drm] GT0: HuC firmware >>> i915/tgl_huc.bin >>> version 7.9.3 >>> [ 155.636177] i915 0000:00:02.0: [drm] GT0: HuC: authenticated! >>> [ 155.636860] i915 0000:00:02.0: [drm] GT0: GUC: submission enabled >>> [ 155.636860] i915 0000:00:02.0: [drm] GT0: GUC: SLPC enabled >>> [ 155.637228] i915 0000:00:02.0: [drm] GT0: GUC: RC enabled >>> [ 155.661583] nvme nvme0: Shutdown timeout set to 10 seconds >>> [ 155.663188] nvme nvme0: 24/0/0 default/read/poll queues >>> [ 155.674267] iwlwifi 0000:00:14.3: WRT: Invalid buffer destination >>> [ 155.823379] ucsi_acpi USBC000:00: possible UCSI driver bug 1 >>> [ 155.823390] ucsi_acpi USBC000:00: failed to re-enable >>> notifications (-22) >>> [ 155.833326] iwlwifi 0000:00:14.3: WFPM_UMAC_PD_NOTIFICATION: 0x1f >>> [ 155.833358] iwlwifi 0000:00:14.3: WFPM_LMAC2_PD_NOTIFICATION: 0x0 >>> [ 155.833367] iwlwifi 0000:00:14.3: WFPM_AUTH_KEY_0: 0x90 >>> [ 155.833377] iwlwifi 0000:00:14.3: CNVI_SCU_SEQ_DATA_DW9: 0x960 >>> [ 155.942363] ata6: SATA link down (SStatus 4 SControl 300) >>> [ 155.942537] ata5: SATA link down (SStatus 4 SControl 300) >>> [ 156.030241] mei_hdcp >>> 0000:00:16.0-b638ab7e-94e2-4ea2-a552-d1c54b627f04: >>> bound 0000:00:02.0 (ops i915_hdcp_ops [i915]) >>> [ 156.030830] OOM killer enabled. >>> [ 156.030831] Restarting tasks ... >>> [ 156.030894] mei_pxp >>> 0000:00:16.0-fbf6fcf1-96cf-4e2e-a6a6-1bab8cbe36b1: >>> bound 0000:00:02.0 (ops i915_pxp_tee_component_ops [i915]) >>> [ 156.031827] done. >>> [ 156.031837] random: crng reseeded on system resumption >>> [ 156.036058] PM: suspend exit >>> [ 158.962881] wlp0s20f3: authenticate with 4c:ed:fb:a0:7f:6c >>> [ 158.966647] wlp0s20f3: send auth to 4c:ed:fb:a0:7f:6c (try 1/3) >>> [ 159.001337] wlp0s20f3: authenticated >>> [ 159.001858] wlp0s20f3: associate with 4c:ed:fb:a0:7f:6c (try 1/3) >>> [ 159.002882] wlp0s20f3: RX AssocResp from 4c:ed:fb:a0:7f:6c >>> (capab=0x11 >>> status=0 aid=1) >>> [ 159.010807] wlp0s20f3: associated >>> [ 159.159528] IPv6: ADDRCONF(NETDEV_CHANGE): wlp0s20f3: link becomes >>> ready >>> [ 287.875205] vfio-pci 0000:04:00.0: Unable to change power state from >>> D3cold to D0, device inaccessible >>> [ 287.936500] vfio-pci 0000:04:00.0: Unable to change power state from >>> D3cold to D0, device inaccessible >>> [ 289.414087] vfio-pci 0000:03:00.0: Unable to change power state from >>> D3cold to D0, device inaccessible >>> [ 289.475297] vfio-pci 0000:03:00.0: Unable to change power state from >>> D3cold to D0, device inaccessible