[AMD Official Use Only] These IB tests are all using direct IB submission including the delayed init work. ________________________________________ 发件人: Koenig, Christian <Christian.Koenig@xxxxxxx> 发送时间: 2021年9月13日 14:19 收件人: Pan, Xinhui; Christian König; amd-gfx@xxxxxxxxxxxxxxxxxxxxx 抄送: Deucher, Alexander 主题: Re: 回复: [PATCH] drm/amdgpu: Fix a race of IB test Well is the delayed init work using direct submission or submission through the scheduler? If the later we have the down_write of the reset semaphore pulled in through the scheduler dependency. Anyway just having the sync before taking the lock should work. Christian. Am 11.09.21 um 12:18 schrieb Pan, Xinhui: > [AMD Official Use Only] > > For the possible deadlock, we can just move flush_delayed_work above down_write. not a big thing. > But I am not aware why delayed init work would try to lock reset_sem. > > delayed init work is enqueued when device resume. It calls amdgpu_ib_ring_tests directly. We need one sync method. > But I see device resume itself woud flush it. So there is no race between them as userspace is still freezed. > > I will drop this flush in V2. > ________________________________________ > 发件人: Christian König <ckoenig.leichtzumerken@xxxxxxxxx> > 发送时间: 2021年9月11日 15:45 > 收件人: Pan, Xinhui; amd-gfx@xxxxxxxxxxxxxxxxxxxxx > 抄送: Deucher, Alexander; Koenig, Christian > 主题: Re: [PATCH] drm/amdgpu: Fix a race of IB test > > > > Am 11.09.21 um 03:55 schrieb xinhui pan: >> Direct IB submission should be exclusive. So use write lock. >> >> Signed-off-by: xinhui pan <xinhui.pan@xxxxxxx> >> --- >> drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c | 9 +++++++-- >> 1 file changed, 7 insertions(+), 2 deletions(-) >> >> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c >> index 19323b4cce7b..acbe02928791 100644 >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c >> @@ -1358,10 +1358,15 @@ static int amdgpu_debugfs_test_ib_show(struct seq_file *m, void *unused) >> } >> >> /* Avoid accidently unparking the sched thread during GPU reset */ >> - r = down_read_killable(&adev->reset_sem); >> + r = down_write_killable(&adev->reset_sem); >> if (r) >> return r; >> >> + /* Avoid concurrently IB test but not cancel it as I don't know whether we >> + * would add more code in the delayed init work. >> + */ >> + flush_delayed_work(&adev->delayed_init_work); >> + > That won't work. It's at least theoretical possible that the delayed > init work waits for the reset_sem which we are holding here. > > Very unlikely to happen, but lockdep might be able to point that out > with a nice backtrace in the logs. > > On the other hand delayed init work and direct IB test through this > interface should work at the same time, so I would just drop it. > > Christian. > >> /* hold on the scheduler */ >> for (i = 0; i < AMDGPU_MAX_RINGS; i++) { >> struct amdgpu_ring *ring = adev->rings[i]; >> @@ -1387,7 +1392,7 @@ static int amdgpu_debugfs_test_ib_show(struct seq_file *m, void *unused) >> kthread_unpark(ring->sched.thread); >> } >> >> - up_read(&adev->reset_sem); >> + up_write(&adev->reset_sem); >> >> pm_runtime_mark_last_busy(dev->dev); >> pm_runtime_put_autosuspend(dev->dev);