Thinking more about this claim - we assume here that if cancel_delayed_work returned true it guarantees that timeout work is not running but, it merely means there was a pending timeout work which was removed from the workqueue before it's timer elapsed and so it didn't have a chance to be dequeued and executed, it doesn't cover already executing work. So there is a possibility where while timeout work started executing another timeout work already got enqueued (maybe through earlier cleanup jobs or through drm_sched_fault) and if at this point another drm_sched_cleanup_jobs runs cancel_delayed_work(&sched->work_tdr) will return true even while there is a timeout job in progress. Unfortunately we cannot change cancel_delayed_work to cancel_delayed_work_sync to flush the timeout work as timeout work itself waits for schedule thread to be parked again when calling park_thread. Andrey ________________________________________ From: amd-gfx <amd-gfx-bounces@xxxxxxxxxxxxxxxxxxxxx> on behalf of Koenig, Christian <Christian.Koenig@xxxxxxx> Sent: 08 November 2019 05:35:18 To: Deng, Emily; amd-gfx@xxxxxxxxxxxxxxxxxxxxx Subject: Re: [PATCH] drm/amdgpu: Fix the null pointer issue for tdr Hi Emily, exactly that can't happen. See here: > /* Don't destroy jobs while the timeout worker is running */ > if (sched->timeout != MAX_SCHEDULE_TIMEOUT && > !cancel_delayed_work(&sched->work_tdr)) > return NULL; We never free jobs while the timeout working is running to prevent exactly that issue. Regards, Christian. Am 08.11.19 um 11:32 schrieb Deng, Emily: > Hi Christian, > The drm_sched_job_timedout-> amdgpu_job_timedout call amdgpu_device_gpu_recover. I mean the main scheduler free the jobs while in amdgpu_device_gpu_recover, and before calling drm_sched_stop. > > Best wishes > Emily Deng > > > >> -----Original Message----- >> From: Koenig, Christian <Christian.Koenig@xxxxxxx> >> Sent: Friday, November 8, 2019 6:26 PM >> To: Deng, Emily <Emily.Deng@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx >> Subject: Re: [PATCH] drm/amdgpu: Fix the null pointer issue for tdr >> >> Hi Emily, >> >> well who is calling amdgpu_device_gpu_recover() in this case? >> >> When it's not the scheduler we shouldn't have a guilty job in the first place. >> >> Regards, >> Christian. >> >> Am 08.11.19 um 11:22 schrieb Deng, Emily: >>> Hi Chrisitan, >>> No, I am with the new branch and also has the patch. Even it are freed by >> main scheduler, how we could avoid main scheduler to free jobs while enter >> to function amdgpu_device_gpu_recover? >>> Best wishes >>> Emily Deng >>> >>> >>> >>>> -----Original Message----- >>>> From: Koenig, Christian <Christian.Koenig@xxxxxxx> >>>> Sent: Friday, November 8, 2019 6:15 PM >>>> To: Deng, Emily <Emily.Deng@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx >>>> Subject: Re: [PATCH] drm/amdgpu: Fix the null pointer issue for tdr >>>> >>>> Hi Emily, >>>> >>>> in this case you are on an old code branch. >>>> >>>> Jobs are freed now by the main scheduler thread and only if no >>>> timeout handler is running. >>>> >>>> See this patch here: >>>>> commit 5918045c4ed492fb5813f980dcf89a90fefd0a4e >>>>> Author: Christian König <christian.koenig@xxxxxxx> >>>>> Date: Thu Apr 18 11:00:21 2019 -0400 >>>>> >>>>> drm/scheduler: rework job destruction >>>> Regards, >>>> Christian. >>>> >>>> Am 08.11.19 um 11:11 schrieb Deng, Emily: >>>>> Hi Christian, >>>>> Please refer to follow log, when it enter to >>>>> amdgpu_device_gpu_recover >>>> function, the bad job 000000005086879e is freeing in function >>>> amdgpu_job_free_cb at the same time, because of the hardware fence >> signal. >>>> But amdgpu_device_gpu_recover goes faster, at this case, the s_fence >>>> is already freed, but job is not freed in time. Then this issue occurs. >>>>> [ 449.792189] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring >> sdma0 >>>>> timeout, signaled seq=2481, emitted seq=2483 [ 449.793202] >>>>> [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: >>>> process pid 0 thread pid 0, s_job:000000005086879e [ 449.794163] >>>> amdgpu >>>> 0000:00:08.0: GPU reset begin! >>>>> [ 449.794175] Emily:amdgpu_job_free_cb,Process information: process >>>>> pid 0 thread pid 0, s_job:000000005086879e [ 449.794221] >>>>> Emily:amdgpu_job_free_cb,Process information: process pid 0 thread >>>>> pid 0, s_job:0000000066eb74ab [ 449.794222] >>>>> Emily:amdgpu_job_free_cb,Process information: process pid 0 thread >>>>> pid 0, s_job:00000000d4438ad9 [ 449.794255] >>>>> Emily:amdgpu_job_free_cb,Process information: process pid 0 thread >>>>> pid 0, s_job:00000000b6d69c65 [ 449.794257] >>>>> Emily:amdgpu_job_free_cb,Process information: process pid 0 thread >>>>> pid 0, >>>> s_job:00000000ea85e922 [ 449.794287] >>>> Emily:amdgpu_job_free_cb,Process >>>> information: process pid 0 thread pid 0, s_job:00000000ed3a5ac6 [ >>>> 449.794366] BUG: unable to handle kernel NULL pointer dereference at >>>> 00000000000000c0 [ 449.800818] PGD 0 P4D 0 [ 449.801040] Oops: 0000 >>>> [#1] SMP PTI >>>>> [ 449.801338] CPU: 3 PID: 55 Comm: kworker/3:1 Tainted: G OE >>>> 4.18.0-15-generic #16~18.04.1-Ubuntu >>>>> [ 449.802157] Hardware name: QEMU Standard PC (i440FX + PIIX, >>>>> 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014 [ 449.802944] >>>>> Workqueue: events drm_sched_job_timedout [amd_sched] [ 449.803488] >> RIP: >>>> 0010:amdgpu_device_gpu_recover+0x1da/0xb60 [amdgpu] >>>>> [ 449.804020] Code: dd ff ff 49 39 c5 48 89 55 a8 0f 85 56 ff ff ff >>>>> 45 85 e4 0f >>>> 85 a1 00 00 00 48 8b 45 b0 48 85 c0 0f 84 60 01 00 00 48 8b 40 10 <48> 8b >> 98 >>>> c0 00 00 00 48 85 db 0f 84 4c 01 00 00 48 8b 43 48 a8 01 >>>>> [ 449.805593] RSP: 0018:ffffb4c7c08f7d68 EFLAGS: 00010286 [ >>>>> 449.806032] RAX: 0000000000000000 RBX: 0000000000000000 RCX: >>>>> 0000000000000000 [ 449.806625] RDX: ffffb4c7c08f5ac0 RSI: >>>>> 0000000fffffffe0 RDI: 0000000000000246 [ 449.807224] RBP: >>>>> ffffb4c7c08f7de0 R08: 00000068b9d54000 R09: 0000000000000000 [ >>>>> 449.807818] R10: 0000000000000000 R11: 0000000000000148 R12: >>>>> 0000000000000000 [ 449.808411] R13: ffffb4c7c08f7da0 R14: >>>>> ffff8d82b8525d40 R15: ffff8d82b8525d40 [ 449.809004] FS: >>>>> 0000000000000000(0000) GS:ffff8d82bfd80000(0000) >>>>> knlGS:0000000000000000 [ 449.809674] CS: 0010 DS: 0000 ES: 0000 CR0: >>>>> 0000000080050033 [ 449.810153] CR2: 00000000000000c0 CR3: >>>>> 000000003cc0a001 CR4: 00000000003606e0 [ 449.810747] DR0: >>>> 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ >>>> 449.811344] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: >>>> 0000000000000400 [ 449.811937] Call Trace: >>>>> [ 449.812206] amdgpu_job_timedout+0x114/0x140 [amdgpu] [ >>>>> 449.812635] drm_sched_job_timedout+0x44/0x90 [amd_sched] [ >>>>> 449.813139] ? amdgpu_cgs_destroy_device+0x10/0x10 [amdgpu] [ >>>>> 449.813609] ? drm_sched_job_timedout+0x44/0x90 [amd_sched] [ >>>>> 449.814077] process_one_work+0x1fd/0x3f0 [ 449.814417] >>>>> worker_thread+0x34/0x410 [ 449.814728] kthread+0x121/0x140 [ >>>>> 449.815004] ? process_one_work+0x3f0/0x3f0 [ 449.815374] ? >>>>> kthread_create_worker_on_cpu+0x70/0x70 >>>>> [ 449.815799] ret_from_fork+0x35/0x40 >>>>> >>>>>> -----Original Message----- >>>>>> From: Koenig, Christian <Christian.Koenig@xxxxxxx> >>>>>> Sent: Friday, November 8, 2019 5:43 PM >>>>>> To: Deng, Emily <Emily.Deng@xxxxxxx>; amd- >> gfx@xxxxxxxxxxxxxxxxxxxxx >>>>>> Subject: Re: [PATCH] drm/amdgpu: Fix the null pointer issue for tdr >>>>>> >>>>>> Am 08.11.19 um 10:39 schrieb Deng, Emily: >>>>>>> Sorry, please take your time. >>>>>> Have you seen my other response a bit below? >>>>>> >>>>>> I can't follow how it would be possible for job->s_fence to be NULL >>>>>> without the job also being freed. >>>>>> >>>>>> So it looks like this patch is just papering over some bigger issues. >>>>>> >>>>>> Regards, >>>>>> Christian. >>>>>> >>>>>>> Best wishes >>>>>>> Emily Deng >>>>>>> >>>>>>> >>>>>>> >>>>>>>> -----Original Message----- >>>>>>>> From: Koenig, Christian <Christian.Koenig@xxxxxxx> >>>>>>>> Sent: Friday, November 8, 2019 5:08 PM >>>>>>>> To: Deng, Emily <Emily.Deng@xxxxxxx>; amd- >>>> gfx@xxxxxxxxxxxxxxxxxxxxx >>>>>>>> Subject: Re: [PATCH] drm/amdgpu: Fix the null pointer issue for >>>>>>>> tdr >>>>>>>> >>>>>>>> Am 08.11.19 um 09:52 schrieb Deng, Emily: >>>>>>>>> Ping..... >>>>>>>> You need to give me at least enough time to wake up :) >>>>>>>> >>>>>>>>> Best wishes >>>>>>>>> Emily Deng >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>>> -----Original Message----- >>>>>>>>>> From: amd-gfx <amd-gfx-bounces@xxxxxxxxxxxxxxxxxxxxx> On >> Behalf >>>>>>>>>> Of Deng, Emily >>>>>>>>>> Sent: Friday, November 8, 2019 10:56 AM >>>>>>>>>> To: Koenig, Christian <Christian.Koenig@xxxxxxx>; amd- >>>>>>>>>> gfx@xxxxxxxxxxxxxxxxxxxxx >>>>>>>>>> Subject: RE: [PATCH] drm/amdgpu: Fix the null pointer issue for >>>>>>>>>> tdr >>>>>>>>>> >>>>>>>>>>> -----Original Message----- >>>>>>>>>>> From: Christian König <ckoenig.leichtzumerken@xxxxxxxxx> >>>>>>>>>>> Sent: Thursday, November 7, 2019 7:28 PM >>>>>>>>>>> To: Deng, Emily <Emily.Deng@xxxxxxx>; >>>>>>>>>>> amd-gfx@xxxxxxxxxxxxxxxxxxxxx >>>>>>>>>>> Subject: Re: [PATCH] drm/amdgpu: Fix the null pointer issue >>>>>>>>>>> for tdr >>>>>>>>>>> >>>>>>>>>>> Am 07.11.19 um 11:25 schrieb Emily Deng: >>>>>>>>>>>> When the job is already signaled, the s_fence is freed. Then >>>>>>>>>>>> it will has null pointer in amdgpu_device_gpu_recover. >>>>>>>>>>> NAK, the s_fence is only set to NULL when the job is destroyed. >>>>>>>>>>> See drm_sched_job_cleanup(). >>>>>>>>>> I know it is set to NULL in drm_sched_job_cleanup. But in one >>>>>>>>>> case, when it enter into the amdgpu_device_gpu_recover, it >>>>>>>>>> already in drm_sched_job_cleanup, and at this time, it will go >>>>>>>>>> to free >>>> job. >>>>>>>>>> But the amdgpu_device_gpu_recover sometimes is faster. At that >>>>>>>>>> time, job is not freed, but s_fence is already NULL. >>>>>>>> No, that case can't happen. See here: >>>>>>>> >>>>>>>>> drm_sched_job_cleanup(s_job); >>>>>>>>> >>>>>>>>> amdgpu_ring_priority_put(ring, s_job->s_priority); >>>>>>>>> dma_fence_put(job->fence); >>>>>>>>> amdgpu_sync_free(&job->sync); >>>>>>>>> amdgpu_sync_free(&job->sched_sync); >>>>>>>>> kfree(job); >>>>>>>> The job itself is freed up directly after freeing the reference >>>>>>>> to the >>>> s_fence. >>>>>>>> So you are just papering over a much bigger problem here. This >>>>>>>> patch is a clear NAK. >>>>>>>> >>>>>>>> Regards, >>>>>>>> Christian. >>>>>>>> >>>>>>>>>>> When you see a job without an s_fence then that means the >>>>>>>>>>> problem is somewhere else. >>>>>>>>>>> >>>>>>>>>>> Regards, >>>>>>>>>>> Christian. >>>>>>>>>>> >>>>>>>>>>>> Signed-off-by: Emily Deng <Emily.Deng@xxxxxxx> >>>>>>>>>>>> --- >>>>>>>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 2 +- >>>>>>>>>>>> drivers/gpu/drm/scheduler/sched_main.c | 11 ++++++----- >>>>>>>>>>>> 2 files changed, 7 insertions(+), 6 deletions(-) >>>>>>>>>>>> >>>>>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>>>>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>>>>>>>>>>> index e6ce949..5a8f08e 100644 >>>>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>>>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>>>>>>>>>>> @@ -4075,7 +4075,7 @@ int >> amdgpu_device_gpu_recover(struct >>>>>>>>>>> amdgpu_device *adev, >>>>>>>>>>>> * >>>>>>>>>>>> * job->base holds a reference to parent fence >>>>>>>>>>>> */ >>>>>>>>>>>> - if (job && job->base.s_fence->parent && >>>>>>>>>>>> + if (job && job->base.s_fence && job->base.s_fence->parent >>>>>> && >>>>>>>>>>>> dma_fence_is_signaled(job->base.s_fence->parent)) >>>>>>>>>>>> job_signaled = true; >>>>>>>>>>>> >>>>>>>>>>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c >>>>>>>>>>>> b/drivers/gpu/drm/scheduler/sched_main.c >>>>>>>>>>>> index 31809ca..56cc10e 100644 >>>>>>>>>>>> --- a/drivers/gpu/drm/scheduler/sched_main.c >>>>>>>>>>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c >>>>>>>>>>>> @@ -334,8 +334,8 @@ void drm_sched_increase_karma(struct >>>>>>>>>>> drm_sched_job >>>>>>>>>>>> *bad) >>>>>>>>>>>> >>>>>>>>>>>> spin_lock(&rq->lock); >>>>>>>>>>>> list_for_each_entry_safe(entity, tmp, >> &rq- >>>>>>> entities, >>>>>>>>>>> list) { >>>>>>>>>>>> - if (bad->s_fence->scheduled.context >>>>>> == >>>>>>>>>>>> - entity->fence_context) { >>>>>>>>>>>> + if (bad->s_fence && (bad->s_fence- >>>>>>>>>>>> scheduled.context == >>>>>>>>>>>> + entity->fence_context)) { >>>>>>>>>>>> if (atomic_read(&bad- >>>>>>> karma) > >>>>>>>>>>>> bad->sched- >>> hang_limit) >>>>>>>>>>>> if (entity- >>> guilty) @@ -376,7 +376,7 @@ void >>>>>>>>>>>> drm_sched_stop(struct >>>>>> drm_gpu_scheduler >>>>>>>>>>> *sched, struct drm_sched_job *bad) >>>>>>>>>>>> * This iteration is thread safe as sched thread is >> stopped. >>>>>>>>>>>> */ >>>>>>>>>>>> list_for_each_entry_safe_reverse(s_job, tmp, &sched- >>>>>>>>>>>> ring_mirror_list, node) { >>>>>>>>>>>> - if (s_job->s_fence->parent && >>>>>>>>>>>> + if (s_job->s_fence && s_job->s_fence->parent && >>>>>>>>>>>> dma_fence_remove_callback(s_job- >>> s_fence- >>>>>>> parent, >>>>>>>>>>>> &s_job->cb)) { >>>>>>>>>>>> atomic_dec(&sched->hw_rq_count); >> @@ - >>>>>> 395,7 >>>>>>>>>> +395,8 @@ void >>>>>>>>>>>> drm_sched_stop(struct drm_gpu_scheduler >>>>>>>>>>> *sched, struct drm_sched_job *bad) >>>>>>>>>>>> * >>>>>>>>>>>> * Job is still alive so fence refcount at >> least 1 >>>>>>>>>>>> */ >>>>>>>>>>>> - dma_fence_wait(&s_job->s_fence->finished, >>>>>> false); >>>>>>>>>>>> + if (s_job->s_fence) >>>>>>>>>>>> + dma_fence_wait(&s_job->s_fence- >>>>>>> finished, >>>>>>>>>>> false); >>>>>>>>>>>> /* >>>>>>>>>>>> * We must keep bad job alive for later >> use >>>>>> during @@ >>>>>>>>>>> -438,7 >>>>>>>>>>>> +439,7 @@ void drm_sched_start(struct drm_gpu_scheduler >>>> *sched, >>>>>>>>>>>> +bool >>>>>>>>>>> full_recovery) >>>>>>>>>>>> * GPU recovers can't run in parallel. >>>>>>>>>>>> */ >>>>>>>>>>>> list_for_each_entry_safe(s_job, tmp, >>>>>>>>>>>> &sched->ring_mirror_list, >>>>>>>>>>>> node) >>>>>>>>>>> { >>>>>>>>>>>> - struct dma_fence *fence = s_job->s_fence->parent; >>>>>>>>>>>> + struct dma_fence *fence = s_job->s_fence ? s_job- >>>>>>> s_fence- >>>>>>>>>>>> parent : >>>>>>>>>>>> +NULL; >>>>>>>>>>>> >>>>>>>>>>>> atomic_inc(&sched->hw_rq_count); >>>>>>>>>>>> >>>>>>>>>> _______________________________________________ >>>>>>>>>> amd-gfx mailing list >>>>>>>>>> amd-gfx@xxxxxxxxxxxxxxxxxxxxx >>>>>>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx _______________________________________________ amd-gfx mailing list amd-gfx@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/amd-gfx _______________________________________________ amd-gfx mailing list amd-gfx@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/amd-gfx