On 8/13/2021 4:01 PM, Michel Dänzer wrote:
On 2021-08-13 6:23 a.m., Lazar, Lijo wrote:
On 8/12/2021 10:24 PM, Michel Dänzer wrote:
On 2021-08-12 1:33 p.m., Lazar, Lijo wrote:
On 8/12/2021 1:41 PM, Michel Dänzer wrote:
On 2021-08-12 7:55 a.m., Koenig, Christian wrote:
Hi James,
Evan seems to have understood how this all works together.
See while any begin/end use critical section is active the work should not be active.
When you handle only one ring you can just call cancel in begin use and schedule in end use. But when you have more than one ring you need a lock or counter to prevent concurrent work items to be started.
Michelle's idea to use mod_delayed_work is a bad one because it assumes that the delayed work is still running.
It merely assumes that the work may already have been scheduled before.
Admittedly, I missed the cancel_delayed_work_sync calls for patch 2. While I think it can still have some effect when there's a single work item for multiple rings, as described by James, it's probably negligible, since presumably the time intervals between ring_begin_use and ring_end_use are normally much shorter than a second.
So, while patch 2 is at worst a no-op (since mod_delayed_work is the same as schedule_delayed_work if the work hasn't been scheduled yet), I'm fine with dropping it.
Something similar applies to the first patch I think,
There are no cancel work calls in that case, so the commit log is accurate TTBOMK.
Curious -
For patch 1, does it make a difference if any delayed work scheduled is cancelled in the else part before proceeding?
} else if (!enable && adev->gfx.gfx_off_state) {
cancel_delayed_work();
I tried the patch below.
While this does seem to fix the problem as well, I see a potential issue:
1. amdgpu_gfx_off_ctrl locks adev->gfx.gfx_off_mutex
2. amdgpu_device_delay_enable_gfx_off runs, blocks in mutex_lock
3. amdgpu_gfx_off_ctrl calls cancel_delayed_work_sync
I'm afraid this would deadlock? (CONFIG_PROVE_LOCKING doesn't complain though)
Should use the cancel_delayed_work instead of the _sync version.
The thing is, it's not clear to me from cancel_delayed_work's description that it's guaranteed not to wait for amdgpu_device_delay_enable_gfx_off to finish if it's already running. If that's not guaranteed, it's prone to the same deadlock.
From what I understood from the the description, cancel initiates a
cancel. If the work has already started, it returns false saying it
couldn't succeed otherwise cancels out the scheduled work and returns
true. In the note below, it asks to specifically use the _sync version
if we need to wait for an already started work and that definitely has
the problem of deadlock you mentioned above.
* Note:
* The work callback function may still be running on return, unless
* it returns %true and the work doesn't re-arm itself. Explicitly
flush or
* use cancel_delayed_work_sync() to wait on it.
As you mentioned - at best work is not scheduled yet and cancelled successfully, or at worst it's waiting for the mutex. In the worst case, if amdgpu_device_delay_enable_gfx_off gets the mutex after amdgpu_gfx_off_ctrl unlocks it, there is an extra check as below.
if (!adev->gfx.gfx_off_state && !adev->gfx.gfx_off_req_count)
The count wouldn't be 0 and hence it won't enable GFXOFF.
I'm not sure, but it might also be possible for amdgpu_device_delay_enable_gfx_off to get the mutex only after amdgpu_gfx_off_ctrl was called again and set adev->gfx.gfx_off_req_count back to 0.
Yes, this is a case we can't avoid in either case. If the work has
already started, then mod_delayed_ also doesn't have any impact. Another
case is work thread already got the mutex and a disable request comes
just at that time. It needs to wait till mutex is released by work, that
could mean enable gfxoff immediately followed by disable.
Maybe it's possible to fix it with cancel_delayed_work_sync somehow, but I'm not sure how offhand. (With cancel_delayed_work instead, I'm worried amdgpu_device_delay_enable_gfx_off might still enable GFXOFF in the HW immediately after amdgpu_gfx_off_ctrl unlocks the mutex. Then again, that might happen with mod_delayed_work as well...)
As mentioned earlier, cancel_delayed_work won't cause this issue.
In the mod_delayed_ patch, mod_ version is called only when req_count is 0. While that is a good thing, it keeps alive one more contender for the mutex.
Not sure what you mean. It leaves the possibility of amdgpu_device_delay_enable_gfx_off running just after amdgpu_gfx_off_ctrl tried to postpone it. As discussed above, something similar might be possible with cancel_delayed_work as well.
The mod_delayed is called only req_count gets back to 0. If there is
another disable request comes after that, it doesn't cancel out the work
scheduled nor does it adjust the delay.
Ex:
Disable gfxoff -> Enable gfxoff (now the work is scheduled) -> Disable
gfxoff (within 5ms or whatever the delay be, but this call won't go to
the mod_delayed path to delay it further) -> Work starts after 5ms and
creates a contention for the mutex -> Enable gfxoff
When cancel_ is used, the second disable call immediately cancels out
any work that is scheduled but not started and it doesn't create an
unnecessary contention for the mutex. It's a matter of who gets the
mutex first. Cancel has a better chance to eliminate the second thread
possibility.
The cancel_ version eliminates that contender if happens to be called at the right time (more likely if there are multiple requests to disable gfxoff). On the other hand, don't know how costly it is to call cancel_ every time on the else part (or maybe call only once when count increments to 1?).
Sure, why not, though I doubt it matters much — I expect adev->gfx.gfx_off_req_count transitioning between 0 <-> 1 to be the most common case by far.
I sent out a v2 patch which should address all these issues.
Will check that.
Thanks,
Lijo