Re: [PATCH v2 0/3] Fix DCN 3.1.4 hangs on s2idle entry

"Lazar, Lijo" <lijo.lazar@xxxxxxx> · Wed, 17 May 2023 10:56:09 +0530

On 5/17/2023 10:46 AM, Limonciello, Mario wrote:

On 5/17/2023 12:07 AM, Lazar, Lijo wrote:

On 5/17/2023 10:25 AM, Limonciello, Mario wrote:

On 5/16/2023 11:43 PM, Lazar, Lijo wrote:

On 5/17/2023 5:04 AM, Mario Limonciello wrote:
DCN 3.1.4 s2idle entry will hang
occasionally on s2idle entry, but only if running Wayland and only
when using `systemctl suspend`, not `echo mem | tee /sys/power/state`.

This happens because using `systemctl suspend` will cause the screen
to lock right before writing mem into /sys/power/state.

A couple of things on this since this mentions systemctl suspend -

1) If in s2idle, it's supposed to immediately signal and not 
schedule delayed work

3964b0c2e843334858da99db881859faa4df241d drm/amdgpu: complete gfxoff 
allow signal during suspend without delay

It looks like dead code to me now actually.

amdgpu_device_set_pg_state() skips GFX, so gfxoff control never gets 
called as part of suspend path.

Ok, that means schedule happened sometime before. 
To come up with these patches I had a test kernel with extra prints that 
showed the function call orders.

With systemctl suspend there is a call to 
gfx_v11_0_get_gpu_clock_counter() from userspace IOCTL that triggers all 
this behavior. 

I think we replaced this with golden timestamp value which doesn't 
require GFX register access.

 Here is the function calls with the patched kernel:

[   32.720456] amdgpu 0000:c2:00.0: amdgpu: set GFX off state to 
enabled, count:1
[   32.720457] amdgpu 0000:c2:00.0: amdgpu: broke gfx_off_mutex for 
gfx_v11_0_get_gpu_clock_counter+0xa8/0xf0 [amdgpu], 
adev->gfx.gfx_off_state is 0
[   32.760475] PM: suspend entry (s2idle)
[   32.768996] Filesystems sync: 0.008 seconds
[   32.769310] Freezing user space processes
[   32.776527] Freezing user space processes completed (elapsed 0.007 
seconds)
[   32.776530] OOM killer disabled.
[   32.776531] Freezing remaining freezable tasks
[   32.777528] Freezing remaining freezable tasks completed (elapsed 
0.000 seconds)
[   32.777531] printk: Suspending console(s) (use no_console_suspend to 
debug)
[   32.817853] amdgpu 0000:c2:00.0: amdgpu: Delayed work to enable gfxoff
[   32.817857] amdgpu 0000:c2:00.0: amdgpu: 
amdgpu_dpm_set_powergating_by_smu by 
amdgpu_device_delay_enable_gfx_off.cold+0x29/0x46 [amdgpu]
[   32.818142] amdgpu 0000:c2:00.0: amdgpu: broke pm.mutex for 
amdgpu_device_delay_enable_gfx_off.cold+0x29/0x46 [amdgpu]
[   32.852099] amdgpu 0000:c2:00.0: amdgpu: smu_suspend: suspend called
[   32.852101] amdgpu 0000:c2:00.0: amdgpu: smu_disable_dpms: called

Without patch 1 the delayed work doesn't get called on entry ever.

Can we remove this code also as there is a flush anyway with patch 1?

Sure.  Do you think it should go into patch 1 or on it's own?

Preferably in patch 1 itself as it explains why it was removed.

Also, is there a need to call GFXOFF forcefully on S0ix suspend (any 
chance that gfxoff is not scheduled)?

If using "echo mem | sudo tee /sys/power/state" I've confirmed that it's 
already in GFXOFF.  I don't think this case should happen.
2) RLC is never stopped on GFX 10 or greater.

System was hanging before this series.

Patch 3 "alone" matches this behavior as described above to skip RLC 
suspend but two problems happen:

1) GFXOFF workqueue doesn't get flushed and so driver's request for 
GFXOFF can happen at wrong time.

2) If suspend entry happens before GFXOFF is really asserted lots of 
errors on resume. IE:

Is patch 3 really required?  Does it make any difference?

No; patch 3 isn't really required with patches 1 and 2.

My preference is to drop patch 3 and not to have an additional place of 
in_s0ix check.

Thanks,
Lijo

Thanks,
Lijo

[   63.095227] [drm] Fence fallback timer expired on ring sdma0
[   63.098360] [drm] ring gfx_32772.1.1 was added
[   63.099439] [drm] ring compute_32772.2.2 was added
[   63.100460] [drm] ring sdma_32772.3.3 was added
[   63.100504] [drm] ring gfx_32772.1.1 test pass
[   63.607166] [drm] Fence fallback timer expired on ring gfx_32772.1.1
[   63.607234] [drm] ring gfx_32772.1.1 ib test pass
[   63.608964] [drm] ring compute_32772.2.2 test pass
[   64.119173] [drm] Fence fallback timer expired on ring 
compute_32772.2.2
[   64.119219] [drm] ring compute_32772.2.2 ib test pass
[   64.121364] [drm] ring sdma_32772.3.3 test pass
[   64.631422] [drm] Fence fallback timer expired on ring sdma_32772.3.3
[   64.631465] [drm] ring sdma_32772.3.3 ib test pass
[   65.143184] [drm] Fence fallback timer expired on ring sdma0

Wondering if the code hides something else because of the timing.
Thanks,
Lijo

This causes a delayed GFXOFF entry to be scheduled right before s2idle
entry.  If the workqueue doesn't get processed before the RLC is 
turned
off the system is hung. Even if the workqueue *does* get processed, 
there
is a race between the APU microcontrollers and driver for whether GFX
is actually powered off when RLC is turned off.

To avoid this issue, flush the workqueue on s2idle entry and ensure 
that
GFX is really in GFXOFF before any sensitive register accesses occur.

Mario Limonciello (3):
   drm/amd: Flush any delayed gfxoff on suspend entry
   drm/amd: Poll for GFX core to be off
   drm/amd: Skip RLC suspend for s0ix on PSP 13.0.4 and 13.0.11

  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 25 
++++++++++++++++++++++
  drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c     | 18 ++++++++++++++++
  drivers/gpu/drm/amd/include/amd_shared.h   |  1 +
  drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c  |  4 ++--
  4 files changed, 46 insertions(+), 2 deletions(-)