Thanks for the clarification Xinhui.
Based on Christian's explanation, what I understood is - this is an
exceptional case in debugfs calls and the other goal is to avoid
maintenance of one more lock just to support this API. I no longer have
any issues with this approach.
Thanks,
Lijo
On 9/13/2021 12:53 PM, Pan, Xinhui wrote:
[AMD Official Use Only]
Of source IB test can hang the GPU.
But it wait fence with one specific timeout. and it not depends on gpu scheduler.
So IB test must can return.
________________________________________
发件人: Lazar, Lijo <Lijo.Lazar@xxxxxxx>
发送时间: 2021年9月13日 15:15
收件人: Christian König; Koenig, Christian; Pan, Xinhui; amd-gfx@xxxxxxxxxxxxxxxxxxxxx
抄送: Deucher, Alexander
主题: Re: 回复: [PATCH v2] drm/amdgpu: Fix a race of IB test
On 9/13/2021 12:21 PM, Christian König wrote:
Keep in mind that we don't try to avoid contention here. The goal is
rather to have as few locks as possible to avoid the extra overhead in
the hot path.
Contention is completely irrelevant for the debug and device reset since
that are rarely occurring events and performance doesn't matter for them.
It is perfectly reasonable to take the write side of the reset lock as
necessary when we need to make sure that we don't have concurrent device
access.
The original code has down_read which gave the impression that there is
some protection to avoid access during reset. Basically would like to
avoid this as a precedence for this sort of usage for any debugfs call.
Reset semaphore is supposed to be a 'protect all' thing and provides a
shortcut.
BTW, question about a hypothetical case - what happens if the test
itself causes a hang and need to trigger a reset? Will there be chance
for the lock to be released (whether a submit call will hang
indefinitely) for the actual reset to be executed?
Thanks,
Lijo
Regards,
Christian.
Am 13.09.21 um 08:43 schrieb Lazar, Lijo:
There are other interfaces to emulate the exact reset process, or
atleast this is not the one we are using for doing any sort of reset
through debugfs.
In any case, the expectation is reset thread takes the write side of
the lock and it's already done somewhere else.
Reset semaphore is supposed to protect the device from concurrent
access (any sort of resource usage is thus protected by default). Then
the same logic can be applied for any other call and that is not a
reasonable ask.
Thanks,
Lijo
On 9/13/2021 12:07 PM, Christian König wrote:
That's complete nonsense.
The debugfs interface emulates parts of the reset procedure for
testing and we absolutely need to take the same locks as the reset to
avoid corruption of the involved objects.
Regards,
Christian.
Am 13.09.21 um 08:25 schrieb Lazar, Lijo:
This is a debugfs interface and adding another writer contention in
debugfs over an actual reset is lazy fix. This shouldn't be executed
in the first place and should not take precedence over any reset.
Thanks,
Lijo
On 9/13/2021 11:52 AM, Christian König wrote:
NAK, this is not the lazy way to fix it at all.
The reset semaphore protects the scheduler and ring objects from
concurrent modification, so taking the write side of it is
perfectly valid here.
Christian.
Am 13.09.21 um 06:42 schrieb Pan, Xinhui:
[AMD Official Use Only]
yep, that is a lazy way to fix it.
I am thinking of adding one amdgpu_ring.direct_access_mutex before
we issue test_ib on each ring.
________________________________________
发件人: Lazar, Lijo <Lijo.Lazar@xxxxxxx>
发送时间: 2021年9月13日 12:00
收件人: Pan, Xinhui; amd-gfx@xxxxxxxxxxxxxxxxxxxxx
抄送: Deucher, Alexander; Koenig, Christian
主题: Re: [PATCH v2] drm/amdgpu: Fix a race of IB test
On 9/13/2021 5:18 AM, xinhui pan wrote:
Direct IB submission should be exclusive. So use write lock.
Signed-off-by: xinhui pan <xinhui.pan@xxxxxxx>
---
drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
index 19323b4cce7b..be5d12ed3db1 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
@@ -1358,7 +1358,7 @@ static int
amdgpu_debugfs_test_ib_show(struct seq_file *m, void *unused)
}
/* Avoid accidently unparking the sched thread during GPU
reset */
- r = down_read_killable(&adev->reset_sem);
+ r = down_write_killable(&adev->reset_sem);
There are many ioctls and debugfs calls which takes this lock and
as you
know the purpose is to avoid them while there is a reset. The
purpose is
*not to* fix any concurrency issues those calls themselves have
otherwise and fixing those concurrency issues this way is just
lazy and
not acceptable.
This will take away any fairness given to the writer in this rw
lock and
that is supposed to be the reset thread.
Thanks,
Lijo
if (r)
return r;
@@ -1387,7 +1387,7 @@ static int
amdgpu_debugfs_test_ib_show(struct seq_file *m, void *unused)
kthread_unpark(ring->sched.thread);
}
- up_read(&adev->reset_sem);
+ up_write(&adev->reset_sem);
pm_runtime_mark_last_busy(dev->dev);
pm_runtime_put_autosuspend(dev->dev);