On Sat, Jan 22, 2022 at 4:38 PM James Turner <linuxkernel.foss@xxxxxxxxxxxxxxxxxxxxxx> wrote: > > Hi Lijo, > > > Could you provide the pp_dpm_* values in sysfs with and without the > > patch? Also, could you try forcing PCIE to gen3 (through pp_dpm_pcie) > > if it's not in gen3 when the issue happens? > > AFAICT, I can't access those values while the AMD GPU PCI devices are > bound to `vfio-pci`. However, I can at least access the link speed and > width elsewhere in sysfs. So, I gathered what information I could for > two different cases: > > - With the PCI devices bound to `vfio-pci`. With this configuration, I > can start the VM, but the `pp_dpm_*` values are not available since > the devices are bound to `vfio-pci` instead of `amdgpu`. > > - Without the PCI devices bound to `vfio-pci` (i.e. after removing the > `vfio-pci.ids=...` kernel command line argument). With this > configuration, I can access the `pp_dpm_*` values, since the PCI > devices are bound to `amdgpu`. However, I cannot use the VM. If I try > to start the VM, the display (both the external monitors attached to > the AMD GPU and the built-in laptop display attached to the Intel > iGPU) completely freezes. > > The output shown below was identical for both the good commit: > f1688bd69ec4 ("drm/amd/amdgpu:save psp ring wptr to avoid attack") > and the commit which introduced the issue: > f9b7f3703ff9 ("drm/amdgpu/acpi: make ATPX/ATCS structures global (v2)") > > Note that the PCI link speed increased to 8.0 GT/s when the GPU was > under heavy load for both versions, but the clock speeds of the GPU were > different under load. (For the good commit, it was 1295 MHz; for the bad > commit, it was 501 MHz.) > Are the ATIF and ATCS ACPI methods available in the guest VM? They are required for this platform to work correctly from a power standpoint. One thing that f9b7f3703ff9 did was to get those ACPI methods executed on certain platforms where they had not been previously due to a bug in the original implementation. If the windows driver doesn't interact with them, it could cause performance issues. It may have worked by accident before because the ACPI interfaces may not have been called, leading the windows driver to believe this was a standalone dGPU rather than one integrated into a power/thermal limited platform. Alex > > # With the PCI devices bound to `vfio-pci` > > ## Before starting the VM > > % ls /sys/module/amdgpu/drivers/pci:amdgpu > module bind new_id remove_id uevent unbind > > % find /sys/bus/pci/devices/0000:01:00.0/ -type f -name 'current_link*' -print -exec cat {} \; > /sys/bus/pci/devices/0000:01:00.0/current_link_width > 8 > /sys/bus/pci/devices/0000:01:00.0/current_link_speed > 8.0 GT/s PCIe > > ## While running the VM, before placing the AMD GPU under heavy load > > % find /sys/bus/pci/devices/0000:01:00.0/ -type f -name 'current_link*' -print -exec cat {} \; > /sys/bus/pci/devices/0000:01:00.0/current_link_width > 8 > /sys/bus/pci/devices/0000:01:00.0/current_link_speed > 2.5 GT/s PCIe > > ## While running the VM, with the AMD GPU under heavy load > > % find /sys/bus/pci/devices/0000:01:00.0/ -type f -name 'current_link*' -print -exec cat {} \; > /sys/bus/pci/devices/0000:01:00.0/current_link_width > 8 > /sys/bus/pci/devices/0000:01:00.0/current_link_speed > 8.0 GT/s PCIe > > ## While running the VM, after stopping the heavy load on the AMD GPU > > % find /sys/bus/pci/devices/0000:01:00.0/ -type f -name 'current_link*' -print -exec cat {} \; > /sys/bus/pci/devices/0000:01:00.0/current_link_width > 8 > /sys/bus/pci/devices/0000:01:00.0/current_link_speed > 2.5 GT/s PCIe > > ## After stopping the VM > > % find /sys/bus/pci/devices/0000:01:00.0/ -type f -name 'current_link*' -print -exec cat {} \; > /sys/bus/pci/devices/0000:01:00.0/current_link_width > 8 > /sys/bus/pci/devices/0000:01:00.0/current_link_speed > 2.5 GT/s PCIe > > > # Without the PCI devices bound to `vfio-pci` > > % ls /sys/module/amdgpu/drivers/pci:amdgpu > 0000:01:00.0 module bind new_id remove_id uevent unbind > > % for f in /sys/module/amdgpu/drivers/pci:amdgpu/*/pp_dpm_*; do echo "$f"; cat "$f"; echo; done > /sys/module/amdgpu/drivers/pci:amdgpu/0000:01:00.0/pp_dpm_mclk > 0: 300Mhz > 1: 625Mhz > 2: 1500Mhz * > > /sys/module/amdgpu/drivers/pci:amdgpu/0000:01:00.0/pp_dpm_pcie > 0: 2.5GT/s, x8 > 1: 8.0GT/s, x16 * > > /sys/module/amdgpu/drivers/pci:amdgpu/0000:01:00.0/pp_dpm_sclk > 0: 214Mhz > 1: 501Mhz > 2: 850Mhz > 3: 1034Mhz > 4: 1144Mhz > 5: 1228Mhz > 6: 1275Mhz > 7: 1295Mhz * > > % find /sys/bus/pci/devices/0000:01:00.0/ -type f -name 'current_link*' -print -exec cat {} \; > /sys/bus/pci/devices/0000:01:00.0/current_link_width > 8 > /sys/bus/pci/devices/0000:01:00.0/current_link_speed > 8.0 GT/s PCIe > > > James