On 8/12/2021 10:24 PM, Michel Dänzer wrote:
On 2021-08-12 1:33 p.m., Lazar, Lijo wrote:
On 8/12/2021 1:41 PM, Michel Dänzer wrote:
On 2021-08-12 7:55 a.m., Koenig, Christian wrote:
Hi James,
Evan seems to have understood how this all works together.
See while any begin/end use critical section is active the work should not be active.
When you handle only one ring you can just call cancel in begin use and schedule in end use. But when you have more than one ring you need a lock or counter to prevent concurrent work items to be started.
Michelle's idea to use mod_delayed_work is a bad one because it assumes that the delayed work is still running.
It merely assumes that the work may already have been scheduled before.
Admittedly, I missed the cancel_delayed_work_sync calls for patch 2. While I think it can still have some effect when there's a single work item for multiple rings, as described by James, it's probably negligible, since presumably the time intervals between ring_begin_use and ring_end_use are normally much shorter than a second.
So, while patch 2 is at worst a no-op (since mod_delayed_work is the same as schedule_delayed_work if the work hasn't been scheduled yet), I'm fine with dropping it.
Something similar applies to the first patch I think,
There are no cancel work calls in that case, so the commit log is accurate TTBOMK.
Curious -
For patch 1, does it make a difference if any delayed work scheduled is cancelled in the else part before proceeding?
} else if (!enable && adev->gfx.gfx_off_state) {
cancel_delayed_work();
I tried the patch below.
While this does seem to fix the problem as well, I see a potential issue:
1. amdgpu_gfx_off_ctrl locks adev->gfx.gfx_off_mutex
2. amdgpu_device_delay_enable_gfx_off runs, blocks in mutex_lock
3. amdgpu_gfx_off_ctrl calls cancel_delayed_work_sync
I'm afraid this would deadlock? (CONFIG_PROVE_LOCKING doesn't complain though)
Should use the cancel_delayed_work instead of the _sync version. As you
mentioned - at best work is not scheduled yet and cancelled
successfully, or at worst it's waiting for the mutex. In the worst case,
if amdgpu_device_delay_enable_gfx_off gets the mutex after
amdgpu_gfx_off_ctrl unlocks it, there is an extra check as below.
if (!adev->gfx.gfx_off_state && !adev->gfx.gfx_off_req_count)
The count wouldn't be 0 and hence it won't enable GFXOFF.
Maybe it's possible to fix it with cancel_delayed_work_sync somehow, but I'm not sure how offhand. (With cancel_delayed_work instead, I'm worried amdgpu_device_delay_enable_gfx_off might still enable GFXOFF in the HW immediately after amdgpu_gfx_off_ctrl unlocks the mutex. Then again, that might happen with mod_delayed_work as well...)
As mentioned earlier, cancel_delayed_work won't cause this issue.
In the mod_delayed_ patch, mod_ version is called only when req_count is
0. While that is a good thing, it keeps alive one more contender for the
mutex.
The cancel_ version eliminates that contender if happens to be called at
the right time (more likely if there are multiple requests to disable
gfxoff). On the other hand, don't know how costly it is to call cancel_
every time on the else part (or maybe call only once when count
increments to 1?).
Thanks,
Lijo
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
index a0be0772c8b3..3e4585ffb9af 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
@@ -570,8 +570,11 @@ void amdgpu_gfx_off_ctrl(struct amdgpu_device *adev, bool enable)
if (enable && !adev->gfx.gfx_off_state && !adev->gfx.gfx_off_req_count) {
schedule_delayed_work(&adev->gfx.gfx_off_delay_work, GFX_OFF_DELAY_ENABLE);
- } else if (!enable && adev->gfx.gfx_off_state) {
- if (!amdgpu_dpm_set_powergating_by_smu(adev, AMD_IP_BLOCK_TYPE_GFX, false)) {
+ } else if (!enable) {
+ cancel_delayed_work_sync(&adev->gfx.gfx_off_delay_work);
+
+ if (adev->gfx.gfx_off_state &&
+ !amdgpu_dpm_set_powergating_by_smu(adev, AMD_IP_BLOCK_TYPE_GFX, false)) {
adev->gfx.gfx_off_state = false;
if (adev->gfx.funcs->init_spm_golden) {