Re: [PATCH v2 0/3] Fix DCN 3.1.4 hangs on s2idle entry

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




I think we replaced this with golden timestamp value which doesn't require GFX register access.

Ah yes; through

5591a051b86b ("drm/amdgpu: refine get gpu clock counter method")

This wasn't part of the kernel this was originally reported on.

I suspect this would significantly decrease the likelihood of it occurring. I'll confirm it. I do think that patches 1/2 still make sense though because gfxoff can be triggered other ways too.

Confirmed that by adding:

5591a051b86b ("drm/amdgpu: refine get gpu clock counter method")
and
ea27ee2bea6b ("drm/amdgpu/gfx11: update gpu_clock_counter logic")
the original issue goes away.

I will still refine my patches and send a v3 up though as GFXOFF can be triggered other ways by userspace and we should avoid this bug.

@Alex:

Can you please queue up ea27ee2bea6b for this week's fixes and include the tags:

Cc: stable@xxxxxxxxxxxxxxx # 6.1.y: 5591a051b86b: drm/amdgpu: refine get gpu clock counter method Cc: stable@xxxxxxxxxxxxxxx # 6.2.y: 5591a051b86b: drm/amdgpu: refine get gpu clock counter method Cc: stable@xxxxxxxxxxxxxxx # 6.3.y: 5591a051b86b: drm/amdgpu: refine get gpu clock counter method

 Here is the function calls with the patched kernel:

[   32.720456] amdgpu 0000:c2:00.0: amdgpu: set GFX off state to enabled, count:1 [   32.720457] amdgpu 0000:c2:00.0: amdgpu: broke gfx_off_mutex for gfx_v11_0_get_gpu_clock_counter+0xa8/0xf0 [amdgpu], adev->gfx.gfx_off_state is 0
[   32.760475] PM: suspend entry (s2idle)
[   32.768996] Filesystems sync: 0.008 seconds
[   32.769310] Freezing user space processes
[   32.776527] Freezing user space processes completed (elapsed 0.007 seconds)
[   32.776530] OOM killer disabled.
[   32.776531] Freezing remaining freezable tasks
[   32.777528] Freezing remaining freezable tasks completed (elapsed 0.000 seconds) [   32.777531] printk: Suspending console(s) (use no_console_suspend to debug) [   32.817853] amdgpu 0000:c2:00.0: amdgpu: Delayed work to enable gfxoff [   32.817857] amdgpu 0000:c2:00.0: amdgpu: amdgpu_dpm_set_powergating_by_smu by amdgpu_device_delay_enable_gfx_off.cold+0x29/0x46 [amdgpu] [   32.818142] amdgpu 0000:c2:00.0: amdgpu: broke pm.mutex for amdgpu_device_delay_enable_gfx_off.cold+0x29/0x46 [amdgpu]
[   32.852099] amdgpu 0000:c2:00.0: amdgpu: smu_suspend: suspend called
[   32.852101] amdgpu 0000:c2:00.0: amdgpu: smu_disable_dpms: called

Without patch 1 the delayed work doesn't get called on entry ever.

Can we remove this code also as there is a flush anyway with patch 1?

Sure.  Do you think it should go into patch 1 or on it's own?


Preferably in patch 1 itself as it explains why it was removed.
OK.

Also, is there a need to call GFXOFF forcefully on S0ix suspend (any chance that gfxoff is not scheduled)?

If using "echo mem | sudo tee /sys/power/state" I've confirmed that it's already in GFXOFF.  I don't think this case should happen.
2) RLC is never stopped on GFX 10 or greater.

System was hanging before this series.

Patch 3 "alone" matches this behavior as described above to skip RLC suspend but two problems happen:

1) GFXOFF workqueue doesn't get flushed and so driver's request for GFXOFF can happen at wrong time.

2) If suspend entry happens before GFXOFF is really asserted lots of errors on resume. IE:


Is patch 3 really required?  Does it make any difference?

No; patch 3 isn't really required with patches 1 and 2.


My preference is to drop patch 3 and not to have an additional place of in_s0ix check.
OK.

Thanks,
Lijo




[Index of Archives]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux