>On 12/11/2024 15:41, Alex Deucher wrote: >> On Wed, Dec 11, 2024 at 3:19 PM Mario Limonciello <superm1@xxxxxxxxxx> wrote: >>> >>> On 12/11/2024 14:08, Alex Deucher wrote: >>>> On Wed, Dec 11, 2024 at 10:56 AM Mario Limonciello <superm1@xxxxxxxxxx> wrote: >>>>> >>>>> From: Mario Limonciello <mario.limonciello@xxxxxxx> >>>>> >>>>> If the kernel hasn't been compiled with PCIe hotplug support this >>>>> can lead to problems with dGPUs that use BOCO because they effectively >>>>> drop off the bus. >>>>> >>>>> To prevent issues, disable BOCO support when compiled without PCIe hotplug. >>>>> >>>>> Reported-by: Gabriel Marcano <gabemarcano@xxxxxxxxx> >>>>> Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/1707#note_2696862 >>>>> Signed-off-by: Mario Limonciello <mario.limonciello@xxxxxxx> >>>> >>>> Acked-by: Alex Deucher <alexander.deucher@xxxxxxx> >>> >>> Thx. >>> >>>> >>>> Seems like this should affect any device which supports d3cold. Maybe >>>> we want something more general as well? >>> >>> Any specific ideas? One of these two hunks I think make sense, leaning >>> upon the second one more strongly. >> >> Actually, I wonder if the affected hardware pre-dates d3cold and uses >> the old proprietary AMD ATPX interface to control dGPU power. In that >> case, the d3cold is managed by the driver rather than the PCI/ACPI >> subsystems. IIRC, there was a workaround in the PCIe hotplug code to >> avoid calling the pci remove function when the driver powered down the >> GPU via ATPX (or the nvidia equivalent). If so, this check should go >> in amdgpu_device_supports_px() instead. > >Gabriel, > >Can you please share a full kernel log so we can clarify which method >your hardware uses? > Sure thing. I am attaching a kernel output from last night (it actually crashed what looks to be the renoir APU as I tried to turn off the computer, which shows up in the logs towards the end). Some caveats about my system: - I'm using some modified ACPI tables: - I've tweaked some WMI-related WMAX code (read/write GPIO for RGB controller) - I've fixed a missing symbol issue (renamed _EC0 to __EC) - Fixed a bunch of other warnings reported by iasl - I have `#define DEBUG 1` in amdgpu_drv.c - I have a patch from https://bugzilla.kernel.org/show_bug.cgi?id=215884 applied - My kernel is using Gentoo patches Looking at my dmesg output, it looks like I'm using ATPX: [ +0.000022] amdgpu: vga_switcheroo: detected switching method \_SB_.PCI0.GP17.VGA_.ATPX handle [ +0.001561] amdgpu: ATPX version 1, functions 0x00000001 [ +0.000120] amdgpu: ATPX Hybrid Graphics Also I see this in my ACPI table dissasembly: Scope (\_SB.PCI0.GP17.VGA) { Name (M189, Buffer (0x0100){}) Name (M190, Ones) Name (M191, Ones) Method (ATPX, 2, Serialized) { If you need me to recompile the kernel and/or disable my changes to my ACPI tables, let me know. Thanks, Gabriel >Thanks, >> >> Alex >> >>> >>> >>> >>> diff --git a/drivers/pci/pci.c >>> b/drivers/pci/pci.c >>> index 0b29ec6e8e5e2..01691f1d26fbe 100644 >>> --- a/drivers/pci/pci.c >>> +++ b/drivers/pci/pci.c >>> @@ -2751,6 +2751,10 @@ int pci_prepare_to_sleep(struct pci_dev *dev) >>> if (target_state == PCI_POWER_ERROR) >>> return -EIO; >>> >>> + if (!IS_ENABLED(CONFIG_HOTPLUG_PCI_PCIE) && >>> + target_state == PCI_D3cold) >>> + return -EBUSY; >>> + >>> pci_enable_wake(dev, target_state, wakeup); >>> >>> error = pci_set_power_state(dev, target_state); >>> @@ -2797,6 +2801,10 @@ int pci_finish_runtime_suspend(struct pci_dev *dev) >>> if (target_state == PCI_POWER_ERROR) >>> return -EIO; >>> >>> + if (!IS_ENABLED(CONFIG_HOTPLUG_PCI_PCIE) && >>> + target_state == PCI_D3cold) >>> + return -EBUSY; >>> + >>> __pci_enable_wake(dev, target_state, pci_dev_run_wake(dev)); >>> >>> error = pci_set_power_state(dev, target_state); >>>> >>>> Alex >>>> >>>>> --- >>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 3 +++ >>>>> 1 file changed, 3 insertions(+) >>>>> >>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>>>> index 764d41434a82f..7db796ebb967e 100644 >>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>>>> @@ -419,6 +419,9 @@ bool amdgpu_device_supports_boco(struct drm_device *dev) >>>>> { >>>>> struct amdgpu_device *adev = drm_to_adev(dev); >>>>> >>>>> + if (!IS_ENABLED(CONFIG_HOTPLUG_PCI_PCIE)) >>>>> + return false; >>>>> + >>>>> if (adev->has_pr3 || >>>>> ((adev->flags & AMD_IS_PX) && amdgpu_is_atpx_hybrid())) >>>>> return true; >>>>> -- >>>>> 2.43.0 >>>>> >>>
Attachment:
dmesg_with_flip_done
Description: Binary data