Patch is: Reviewed-by: Alex Deucher <alexander.deucher@xxxxxxx> On Thu, Oct 27, 2022 at 9:39 AM Alex Deucher <alexdeucher@xxxxxxxxx> wrote: > > On Thu, Oct 27, 2022 at 2:31 AM Christian König > <christian.koenig@xxxxxxx> wrote: > > > > Am 26.10.22 um 21:03 schrieb Mario Limonciello: > > > If a system does not have swap and memory is under 100% usage, > > > amdgpu will fail to evict resources. Currently the suspend > > > carries on proceeding to reset the GPU: > > > > > > ``` > > > [drm] evicting device resources failed > > > [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] *ERROR* suspend of IP block <vcn_v3_0> failed -12 > > > [drm] free PSP TMR buffer > > > [TTM] Failed allocating page table > > > [drm] evicting device resources failed > > > amdgpu 0000:03:00.0: amdgpu: MODE1 reset > > > amdgpu 0000:03:00.0: amdgpu: GPU mode1 reset > > > amdgpu 0000:03:00.0: amdgpu: GPU smu mode1 reset > > > ``` > > > > > > At this point if the suspend actually succeeded I think that amdgpu > > > would have recovered because the GPU would have power cut off and > > > restored. However the kernel fails to continue the suspend from the > > > memory pressure and amdgpu fails to run the "resume" from the aborted > > > suspend. > > > > > > ``` > > > ACPI: PM: Preparing to enter system sleep state S3 > > > SLUB: Unable to allocate memory on node -1, gfp=0xdc0(GFP_KERNEL|__GFP_ZERO) > > > cache: Acpi-State, object size: 80, buffer size: 80, default order: 0, min order: 0 > > > node 0: slabs: 22, objs: 1122, free: 0 > > > ACPI Error: AE_NO_MEMORY, Could not update object reference count (20210730/utdelete-651) > > > > > > [drm:psp_hw_start [amdgpu]] *ERROR* PSP load kdb failed! > > > [drm:psp_resume [amdgpu]] *ERROR* PSP resume failed > > > [drm:amdgpu_device_fw_loading [amdgpu]] *ERROR* resume of IP block <psp> failed -62 > > > amdgpu 0000:03:00.0: amdgpu: amdgpu_device_ip_resume failed (-62). > > > PM: dpm_run_callback(): pci_pm_resume+0x0/0x100 returns -62 > > > amdgpu 0000:03:00.0: PM: failed to resume async: error -62 > > > ``` > > > > > > To avoid this series of unfortunate events, fail amdgpu's suspend > > > when the memory eviction fails. This will let the system gracefully > > > recover and the user can try suspend again when the memory pressure > > > is relieved. > > > > Yeah, I've been thinking about that handling for a while now as well. > > > > Failing to suspend when we are OOM is certainly the right thing to do > > from a technical perspective. > > > > But it also means that when users close their laptop it can happen that > > it keeps running and draining the battery. > > > > On the other hand when you don't have swap configured it's your fault > > and not the drivers. > > > > It's a trade off and I'm not sure what's better. Alex any comment here? > > Probably better handled in userspace as Mario noted. Also, at least > the whole driver won't fall apart due to important buffers getting > lost. > > Alex > > > > > > Thanks, > > Christian. > > > > > > > > Reported-by: post@xxxxxxxxxx > > > Link: https://gitlab.freedesktop.org/drm/amd/-/issues/2223 > > > Signed-off-by: Mario Limonciello <mario.limonciello@xxxxxxx> > > > --- > > > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 15 ++++++++++----- > > > 1 file changed, 10 insertions(+), 5 deletions(-) > > > > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > > > index 6f958603c8cc2..ae10acede495e 100644 > > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > > > @@ -4060,15 +4060,18 @@ void amdgpu_device_fini_sw(struct amdgpu_device *adev) > > > * at suspend time. > > > * > > > */ > > > -static void amdgpu_device_evict_resources(struct amdgpu_device *adev) > > > +static int amdgpu_device_evict_resources(struct amdgpu_device *adev) > > > { > > > + int ret; > > > + > > > /* No need to evict vram on APUs for suspend to ram or s2idle */ > > > if ((adev->in_s3 || adev->in_s0ix) && (adev->flags & AMD_IS_APU)) > > > - return; > > > + return 0; > > > > > > - if (amdgpu_ttm_evict_resources(adev, TTM_PL_VRAM)) > > > + ret = amdgpu_ttm_evict_resources(adev, TTM_PL_VRAM); > > > + if (ret) > > > DRM_WARN("evicting device resources failed\n"); > > > - > > > + return ret; > > > } > > > > > > /* > > > @@ -4118,7 +4121,9 @@ int amdgpu_device_suspend(struct drm_device *dev, bool fbcon) > > > if (!adev->in_s0ix) > > > amdgpu_amdkfd_suspend(adev, adev->in_runpm); > > > > > > - amdgpu_device_evict_resources(adev); > > > + r = amdgpu_device_evict_resources(adev); > > > + if (r) > > > + return r; > > > > > > amdgpu_fence_driver_hw_fini(adev); > > > > >