[AMD Official Use Only] For the possible deadlock, we can just move flush_delayed_work above down_write. not a big thing. But I am not aware why delayed init work would try to lock reset_sem. delayed init work is enqueued when device resume. It calls amdgpu_ib_ring_tests directly. We need one sync method. But I see device resume itself woud flush it. So there is no race between them as userspace is still freezed. I will drop this flush in V2. ________________________________________ 发件人: Christian König <ckoenig.leichtzumerken@xxxxxxxxx> 发送时间: 2021年9月11日 15:45 收件人: Pan, Xinhui; amd-gfx@xxxxxxxxxxxxxxxxxxxxx 抄送: Deucher, Alexander; Koenig, Christian 主题: Re: [PATCH] drm/amdgpu: Fix a race of IB test Am 11.09.21 um 03:55 schrieb xinhui pan: > Direct IB submission should be exclusive. So use write lock. > > Signed-off-by: xinhui pan <xinhui.pan@xxxxxxx> > --- > drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c | 9 +++++++-- > 1 file changed, 7 insertions(+), 2 deletions(-) > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c > index 19323b4cce7b..acbe02928791 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c > @@ -1358,10 +1358,15 @@ static int amdgpu_debugfs_test_ib_show(struct seq_file *m, void *unused) > } > > /* Avoid accidently unparking the sched thread during GPU reset */ > - r = down_read_killable(&adev->reset_sem); > + r = down_write_killable(&adev->reset_sem); > if (r) > return r; > > + /* Avoid concurrently IB test but not cancel it as I don't know whether we > + * would add more code in the delayed init work. > + */ > + flush_delayed_work(&adev->delayed_init_work); > + That won't work. It's at least theoretical possible that the delayed init work waits for the reset_sem which we are holding here. Very unlikely to happen, but lockdep might be able to point that out with a nice backtrace in the logs. On the other hand delayed init work and direct IB test through this interface should work at the same time, so I would just drop it. Christian. > /* hold on the scheduler */ > for (i = 0; i < AMDGPU_MAX_RINGS; i++) { > struct amdgpu_ring *ring = adev->rings[i]; > @@ -1387,7 +1392,7 @@ static int amdgpu_debugfs_test_ib_show(struct seq_file *m, void *unused) > kthread_unpark(ring->sched.thread); > } > > - up_read(&adev->reset_sem); > + up_write(&adev->reset_sem); > > pm_runtime_mark_last_busy(dev->dev); > pm_runtime_put_autosuspend(dev->dev);