[AMD Official Use Only] Of source IB test can hang the GPU. But it wait fence with one specific timeout. and it not depends on gpu scheduler. So IB test must can return. ________________________________________ 发件人: Lazar, Lijo <Lijo.Lazar@xxxxxxx> 发送时间: 2021年9月13日 15:15 收件人: Christian König; Koenig, Christian; Pan, Xinhui; amd-gfx@xxxxxxxxxxxxxxxxxxxxx 抄送: Deucher, Alexander 主题: Re: 回复: [PATCH v2] drm/amdgpu: Fix a race of IB test On 9/13/2021 12:21 PM, Christian König wrote: > Keep in mind that we don't try to avoid contention here. The goal is > rather to have as few locks as possible to avoid the extra overhead in > the hot path. > > Contention is completely irrelevant for the debug and device reset since > that are rarely occurring events and performance doesn't matter for them. > > It is perfectly reasonable to take the write side of the reset lock as > necessary when we need to make sure that we don't have concurrent device > access. The original code has down_read which gave the impression that there is some protection to avoid access during reset. Basically would like to avoid this as a precedence for this sort of usage for any debugfs call. Reset semaphore is supposed to be a 'protect all' thing and provides a shortcut. BTW, question about a hypothetical case - what happens if the test itself causes a hang and need to trigger a reset? Will there be chance for the lock to be released (whether a submit call will hang indefinitely) for the actual reset to be executed? Thanks, Lijo > > Regards, > Christian. > > Am 13.09.21 um 08:43 schrieb Lazar, Lijo: >> There are other interfaces to emulate the exact reset process, or >> atleast this is not the one we are using for doing any sort of reset >> through debugfs. >> >> In any case, the expectation is reset thread takes the write side of >> the lock and it's already done somewhere else. >> >> Reset semaphore is supposed to protect the device from concurrent >> access (any sort of resource usage is thus protected by default). Then >> the same logic can be applied for any other call and that is not a >> reasonable ask. >> >> Thanks, >> Lijo >> >> On 9/13/2021 12:07 PM, Christian König wrote: >>> That's complete nonsense. >>> >>> The debugfs interface emulates parts of the reset procedure for >>> testing and we absolutely need to take the same locks as the reset to >>> avoid corruption of the involved objects. >>> >>> Regards, >>> Christian. >>> >>> Am 13.09.21 um 08:25 schrieb Lazar, Lijo: >>>> This is a debugfs interface and adding another writer contention in >>>> debugfs over an actual reset is lazy fix. This shouldn't be executed >>>> in the first place and should not take precedence over any reset. >>>> >>>> Thanks, >>>> Lijo >>>> >>>> >>>> On 9/13/2021 11:52 AM, Christian König wrote: >>>>> NAK, this is not the lazy way to fix it at all. >>>>> >>>>> The reset semaphore protects the scheduler and ring objects from >>>>> concurrent modification, so taking the write side of it is >>>>> perfectly valid here. >>>>> >>>>> Christian. >>>>> >>>>> Am 13.09.21 um 06:42 schrieb Pan, Xinhui: >>>>>> [AMD Official Use Only] >>>>>> >>>>>> yep, that is a lazy way to fix it. >>>>>> >>>>>> I am thinking of adding one amdgpu_ring.direct_access_mutex before >>>>>> we issue test_ib on each ring. >>>>>> ________________________________________ >>>>>> 发件人: Lazar, Lijo <Lijo.Lazar@xxxxxxx> >>>>>> 发送时间: 2021年9月13日 12:00 >>>>>> 收件人: Pan, Xinhui; amd-gfx@xxxxxxxxxxxxxxxxxxxxx >>>>>> 抄送: Deucher, Alexander; Koenig, Christian >>>>>> 主题: Re: [PATCH v2] drm/amdgpu: Fix a race of IB test >>>>>> >>>>>> >>>>>> >>>>>> On 9/13/2021 5:18 AM, xinhui pan wrote: >>>>>>> Direct IB submission should be exclusive. So use write lock. >>>>>>> >>>>>>> Signed-off-by: xinhui pan <xinhui.pan@xxxxxxx> >>>>>>> --- >>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c | 4 ++-- >>>>>>> 1 file changed, 2 insertions(+), 2 deletions(-) >>>>>>> >>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c >>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c >>>>>>> index 19323b4cce7b..be5d12ed3db1 100644 >>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c >>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c >>>>>>> @@ -1358,7 +1358,7 @@ static int >>>>>>> amdgpu_debugfs_test_ib_show(struct seq_file *m, void *unused) >>>>>>> } >>>>>>> >>>>>>> /* Avoid accidently unparking the sched thread during GPU >>>>>>> reset */ >>>>>>> - r = down_read_killable(&adev->reset_sem); >>>>>>> + r = down_write_killable(&adev->reset_sem); >>>>>> There are many ioctls and debugfs calls which takes this lock and >>>>>> as you >>>>>> know the purpose is to avoid them while there is a reset. The >>>>>> purpose is >>>>>> *not to* fix any concurrency issues those calls themselves have >>>>>> otherwise and fixing those concurrency issues this way is just >>>>>> lazy and >>>>>> not acceptable. >>>>>> >>>>>> This will take away any fairness given to the writer in this rw >>>>>> lock and >>>>>> that is supposed to be the reset thread. >>>>>> >>>>>> Thanks, >>>>>> Lijo >>>>>> >>>>>>> if (r) >>>>>>> return r; >>>>>>> >>>>>>> @@ -1387,7 +1387,7 @@ static int >>>>>>> amdgpu_debugfs_test_ib_show(struct seq_file *m, void *unused) >>>>>>> kthread_unpark(ring->sched.thread); >>>>>>> } >>>>>>> >>>>>>> - up_read(&adev->reset_sem); >>>>>>> + up_write(&adev->reset_sem); >>>>>>> >>>>>>> pm_runtime_mark_last_busy(dev->dev); >>>>>>> pm_runtime_put_autosuspend(dev->dev); >>>>>>> >>>>> >>> >