On Wed, Mar 29, 2023 at 9:23 PM Mario Limonciello <mario.limonciello@xxxxxxx> wrote: > > > On 3/29/23 04:59, Kai-Heng Feng wrote: > > When the power is lost due to ACPI power resources being turned off, the > > driver should reset the GPU so it can work anew. > > > > First, _PR3 support of the hierarchy needs to be found correctly. Since > > the GPU on some discrete GFX cards is behind a PCIe switch, checking the > > _PR3 on downstream port alone is not enough, as the _PR3 can associate > > to the root port above the PCIe switch. > > I think this should be split into two commits: > > * One of them to look at _PR3 further up in hierarchy to fix indication > for BOCO support. Yes, this part can be split up. > > * One to adjust policy for whether to reset IIUC, the GPU only needs to be reset when the power status isn't certain? Assuming power resources in _PR3 are really disabled, GPU is already reset by itself. That means reset shouldn't be necessary for D3cold, am I understanding it correctly? However, this is a desktop plugged with GFX card that has external power, does that assumption still stand? Perform resetting on D3cold can cover this scenario. > > > > Once the _PR3 is found and BOCO support is correctly marked, use that > > information to inform the GPU should be reset. This solves an issue that > > system freeze on a Intel ADL desktop that uses S0ix for sleep and D3cold > > is supported for the GFX slot. > > I'm worried this is still papering over an underlying issue with L0s > handling on ALD + Navi1x/Navi2x. Is it possible to get the ASIC's ASPM parameter under Windows? Knowing the difference can be useful. > > Also, what about runtime suspend? If you unplug the monitor from this > dGPU and interact with it over SSH it should go into runtime suspend. > > Is it working properly for that case now? Thanks for the tip. Runtime resume doesn't work at all: [ 1087.601631] pcieport 0000:00:01.0: power state changed by ACPI to D0 [ 1087.613820] pcieport 0000:00:01.0: restoring config space at offset 0x2c (was 0x43, writing 0x43) [ 1087.613835] pcieport 0000:00:01.0: restoring config space at offset 0x28 (was 0x41, writing 0x41) [ 1087.613841] pcieport 0000:00:01.0: restoring config space at offset 0x24 (was 0xfff10001, writing 0xfff10001) [ 1087.613978] pcieport 0000:00:01.0: PME# disabled [ 1087.613984] pcieport 0000:00:01.0: waiting 100 ms for downstream link, after activation [ 1089.330956] pcieport 0000:01:00.0: not ready 1023ms after resume; giving up [ 1089.373036] pcieport 0000:01:00.0: Unable to change power state from D3cold to D0, device inaccessible After a short while the whole system froze. So the upstream port of GFX's PCIe switch cannot be powered on again. Kai-Heng > > > > > Fixes: 0064b0ce85bb ("drm/amd/pm: enable ASPM by default") > > Link: https://gitlab.freedesktop.org/drm/amd/-/issues/1885 > > Link: https://gitlab.freedesktop.org/drm/amd/-/issues/2458 > > Signed-off-by: Kai-Heng Feng <kai.heng.feng@xxxxxxxxxxxxx> > > --- > > drivers/gpu/drm/amd/amdgpu/amdgpu_acpi.c | 3 +++ > > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 7 ++++++- > > drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 12 +++++------- > > 3 files changed, 14 insertions(+), 8 deletions(-) > > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_acpi.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_acpi.c > > index 60b1857f469e..407456ac0e84 100644 > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_acpi.c > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_acpi.c > > @@ -987,6 +987,9 @@ bool amdgpu_acpi_should_gpu_reset(struct amdgpu_device *adev) > > if (amdgpu_sriov_vf(adev)) > > return false; > > > > + if (amdgpu_device_supports_boco(adev_to_drm(adev))) > > + return true; > > + > > #if IS_ENABLED(CONFIG_SUSPEND) > > return pm_suspend_target_state != PM_SUSPEND_TO_IDLE; > > #else > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > > index f5658359ff5c..d56b7a2bafa6 100644 > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > > @@ -2181,7 +2181,12 @@ static int amdgpu_device_ip_early_init(struct amdgpu_device *adev) > > > > if (!(adev->flags & AMD_IS_APU)) { > > parent = pci_upstream_bridge(adev->pdev); > > - adev->has_pr3 = parent ? pci_pr3_present(parent) : false; > > + do { > > + if (pci_pr3_present(parent)) { > > + adev->has_pr3 = true; > > + break; > > + } > > + } while ((parent = pci_upstream_bridge(parent))); > > } > > > > amdgpu_amdkfd_device_probe(adev); > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c > > index ba5def374368..5d81fcac4b0a 100644 > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c > > @@ -2415,10 +2415,11 @@ static int amdgpu_pmops_suspend(struct device *dev) > > struct drm_device *drm_dev = dev_get_drvdata(dev); > > struct amdgpu_device *adev = drm_to_adev(drm_dev); > > > > - if (amdgpu_acpi_is_s0ix_active(adev)) > > - adev->in_s0ix = true; > > - else if (amdgpu_acpi_is_s3_active(adev)) > > + if (amdgpu_acpi_is_s3_active(adev) || > > + amdgpu_device_supports_boco(drm_dev)) > > adev->in_s3 = true; > > + else if (amdgpu_acpi_is_s0ix_active(adev)) > > + adev->in_s0ix = true; > > if (!adev->in_s0ix && !adev->in_s3) > > return 0; > > return amdgpu_device_suspend(drm_dev, true); > > @@ -2449,10 +2450,7 @@ static int amdgpu_pmops_resume(struct device *dev) > > adev->no_hw_access = true; > > > > r = amdgpu_device_resume(drm_dev, true); > > - if (amdgpu_acpi_is_s0ix_active(adev)) > > - adev->in_s0ix = false; > > - else > > - adev->in_s3 = false; > > + adev->in_s0ix = adev->in_s3 = false; > > return r; > > } > >