Hi Andrey, I don’t think your patch will help for this. As it will may call kthread_should_park in drm_sched_cleanup_jobs first, and then call kcl_kthread_park. And then it still has a race between the 2 threads. Best wishes Emily Deng >-----Original Message----- >From: Grodzovsky, Andrey <Andrey.Grodzovsky@xxxxxxx> >Sent: Saturday, November 9, 2019 3:01 AM >To: Koenig, Christian <Christian.Koenig@xxxxxxx>; Deng, Emily ><Emily.Deng@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx >Subject: Re: [PATCH] drm/amdgpu: Fix the null pointer issue for tdr > > >On 11/8/19 5:35 AM, Koenig, Christian wrote: >> Hi Emily, >> >> exactly that can't happen. See here: >> >>> /* Don't destroy jobs while the timeout worker is running */ >>> if (sched->timeout != MAX_SCHEDULE_TIMEOUT && >>> !cancel_delayed_work(&sched->work_tdr)) >>> return NULL; >> We never free jobs while the timeout working is running to prevent >> exactly that issue. > > >I don't think this protects us if drm_sched_cleanup_jobs is called for scheduler >which didn't experience a timeout, in amdgpu_device_gpu_recover we access >sched->ring_mirror_list for all the schedulers on a device so this condition >above won't protect us. What in fact could help maybe is my recent patch >541c521 drm/sched: Avoid job cleanup if sched thread is parked. because we >do park each of the scheduler threads during tdr job before trying to access >sched->ring_mirror_list. > >Emily - did you see this problem with that patch in place ? I only pushed it >yesterday. > >Andrey > > >> >> Regards, >> Christian. >> >> Am 08.11.19 um 11:32 schrieb Deng, Emily: >>> Hi Christian, >>> The drm_sched_job_timedout-> amdgpu_job_timedout call >amdgpu_device_gpu_recover. I mean the main scheduler free the jobs while >in amdgpu_device_gpu_recover, and before calling drm_sched_stop. >>> >>> Best wishes >>> Emily Deng >>> >>> >>> >>>> -----Original Message----- >>>> From: Koenig, Christian <Christian.Koenig@xxxxxxx> >>>> Sent: Friday, November 8, 2019 6:26 PM >>>> To: Deng, Emily <Emily.Deng@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx >>>> Subject: Re: [PATCH] drm/amdgpu: Fix the null pointer issue for tdr >>>> >>>> Hi Emily, >>>> >>>> well who is calling amdgpu_device_gpu_recover() in this case? >>>> >>>> When it's not the scheduler we shouldn't have a guilty job in the first place. >>>> >>>> Regards, >>>> Christian. >>>> >>>> Am 08.11.19 um 11:22 schrieb Deng, Emily: >>>>> Hi Chrisitan, >>>>> No, I am with the new branch and also has the patch. Even >>>>> it are freed by >>>> main scheduler, how we could avoid main scheduler to free jobs while >>>> enter to function amdgpu_device_gpu_recover? >>>>> Best wishes >>>>> Emily Deng >>>>> >>>>> >>>>> >>>>>> -----Original Message----- >>>>>> From: Koenig, Christian <Christian.Koenig@xxxxxxx> >>>>>> Sent: Friday, November 8, 2019 6:15 PM >>>>>> To: Deng, Emily <Emily.Deng@xxxxxxx>; >>>>>> amd-gfx@xxxxxxxxxxxxxxxxxxxxx >>>>>> Subject: Re: [PATCH] drm/amdgpu: Fix the null pointer issue for >>>>>> tdr >>>>>> >>>>>> Hi Emily, >>>>>> >>>>>> in this case you are on an old code branch. >>>>>> >>>>>> Jobs are freed now by the main scheduler thread and only if no >>>>>> timeout handler is running. >>>>>> >>>>>> See this patch here: >>>>>>> commit 5918045c4ed492fb5813f980dcf89a90fefd0a4e >>>>>>> Author: Christian König <christian.koenig@xxxxxxx> >>>>>>> Date: Thu Apr 18 11:00:21 2019 -0400 >>>>>>> >>>>>>> drm/scheduler: rework job destruction >>>>>> Regards, >>>>>> Christian. >>>>>> >>>>>> Am 08.11.19 um 11:11 schrieb Deng, Emily: >>>>>>> Hi Christian, >>>>>>> Please refer to follow log, when it enter to >>>>>>> amdgpu_device_gpu_recover >>>>>> function, the bad job 000000005086879e is freeing in function >>>>>> amdgpu_job_free_cb at the same time, because of the hardware >>>>>> fence >>>> signal. >>>>>> But amdgpu_device_gpu_recover goes faster, at this case, the >>>>>> s_fence is already freed, but job is not freed in time. Then this issue >occurs. >>>>>>> [ 449.792189] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring >>>> sdma0 >>>>>>> timeout, signaled seq=2481, emitted seq=2483 [ 449.793202] >>>>>>> [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: >>>>>> process pid 0 thread pid 0, s_job:000000005086879e [ >>>>>> 449.794163] amdgpu >>>>>> 0000:00:08.0: GPU reset begin! >>>>>>> [ 449.794175] Emily:amdgpu_job_free_cb,Process information: >>>>>>> process pid 0 thread pid 0, s_job:000000005086879e [ >>>>>>> 449.794221] Emily:amdgpu_job_free_cb,Process information: process >>>>>>> pid 0 thread pid 0, s_job:0000000066eb74ab [ 449.794222] >>>>>>> Emily:amdgpu_job_free_cb,Process information: process pid 0 >>>>>>> thread pid 0, s_job:00000000d4438ad9 [ 449.794255] >>>>>>> Emily:amdgpu_job_free_cb,Process information: process pid 0 >>>>>>> thread pid 0, s_job:00000000b6d69c65 [ 449.794257] >>>>>>> Emily:amdgpu_job_free_cb,Process information: process pid 0 >>>>>>> thread pid 0, >>>>>> s_job:00000000ea85e922 [ 449.794287] >>>>>> Emily:amdgpu_job_free_cb,Process >>>>>> information: process pid 0 thread pid 0, s_job:00000000ed3a5ac6 >>>>>> [ 449.794366] BUG: unable to handle kernel NULL pointer >>>>>> dereference at >>>>>> 00000000000000c0 [ 449.800818] PGD 0 P4D 0 [ 449.801040] Oops: >>>>>> 0000 [#1] SMP PTI >>>>>>> [ 449.801338] CPU: 3 PID: 55 Comm: kworker/3:1 Tainted: G OE >>>>>> 4.18.0-15-generic #16~18.04.1-Ubuntu >>>>>>> [ 449.802157] Hardware name: QEMU Standard PC (i440FX + PIIX, >>>>>>> 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014 [ 449.802944] >>>>>>> Workqueue: events drm_sched_job_timedout [amd_sched] [ >>>>>>> 449.803488] >>>> RIP: >>>>>> 0010:amdgpu_device_gpu_recover+0x1da/0xb60 [amdgpu] >>>>>>> [ 449.804020] Code: dd ff ff 49 39 c5 48 89 55 a8 0f 85 56 ff ff >>>>>>> ff >>>>>>> 45 85 e4 0f >>>>>> 85 a1 00 00 00 48 8b 45 b0 48 85 c0 0f 84 60 01 00 00 48 8b 40 10 >>>>>> <48> 8b >>>> 98 >>>>>> c0 00 00 00 48 85 db 0f 84 4c 01 00 00 48 8b 43 48 a8 01 >>>>>>> [ 449.805593] RSP: 0018:ffffb4c7c08f7d68 EFLAGS: 00010286 [ >>>>>>> 449.806032] RAX: 0000000000000000 RBX: 0000000000000000 RCX: >>>>>>> 0000000000000000 [ 449.806625] RDX: ffffb4c7c08f5ac0 RSI: >>>>>>> 0000000fffffffe0 RDI: 0000000000000246 [ 449.807224] RBP: >>>>>>> ffffb4c7c08f7de0 R08: 00000068b9d54000 R09: 0000000000000000 [ >>>>>>> 449.807818] R10: 0000000000000000 R11: 0000000000000148 R12: >>>>>>> 0000000000000000 [ 449.808411] R13: ffffb4c7c08f7da0 R14: >>>>>>> ffff8d82b8525d40 R15: ffff8d82b8525d40 [ 449.809004] FS: >>>>>>> 0000000000000000(0000) GS:ffff8d82bfd80000(0000) >>>>>>> knlGS:0000000000000000 [ 449.809674] CS: 0010 DS: 0000 ES: 0000 >CR0: >>>>>>> 0000000080050033 [ 449.810153] CR2: 00000000000000c0 CR3: >>>>>>> 000000003cc0a001 CR4: 00000000003606e0 [ 449.810747] DR0: >>>>>> 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 >[ >>>>>> 449.811344] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: >>>>>> 0000000000000400 [ 449.811937] Call Trace: >>>>>>> [ 449.812206] amdgpu_job_timedout+0x114/0x140 [amdgpu] [ >>>>>>> 449.812635] drm_sched_job_timedout+0x44/0x90 [amd_sched] [ >>>>>>> 449.813139] ? amdgpu_cgs_destroy_device+0x10/0x10 [amdgpu] [ >>>>>>> 449.813609] ? drm_sched_job_timedout+0x44/0x90 [amd_sched] [ >>>>>>> 449.814077] process_one_work+0x1fd/0x3f0 [ 449.814417] >>>>>>> worker_thread+0x34/0x410 [ 449.814728] kthread+0x121/0x140 [ >>>>>>> 449.815004] ? process_one_work+0x3f0/0x3f0 [ 449.815374] ? >>>>>>> kthread_create_worker_on_cpu+0x70/0x70 >>>>>>> [ 449.815799] ret_from_fork+0x35/0x40 >>>>>>> >>>>>>>> -----Original Message----- >>>>>>>> From: Koenig, Christian <Christian.Koenig@xxxxxxx> >>>>>>>> Sent: Friday, November 8, 2019 5:43 PM >>>>>>>> To: Deng, Emily <Emily.Deng@xxxxxxx>; amd- >>>> gfx@xxxxxxxxxxxxxxxxxxxxx >>>>>>>> Subject: Re: [PATCH] drm/amdgpu: Fix the null pointer issue for >>>>>>>> tdr >>>>>>>> >>>>>>>> Am 08.11.19 um 10:39 schrieb Deng, Emily: >>>>>>>>> Sorry, please take your time. >>>>>>>> Have you seen my other response a bit below? >>>>>>>> >>>>>>>> I can't follow how it would be possible for job->s_fence to be >>>>>>>> NULL without the job also being freed. >>>>>>>> >>>>>>>> So it looks like this patch is just papering over some bigger issues. >>>>>>>> >>>>>>>> Regards, >>>>>>>> Christian. >>>>>>>> >>>>>>>>> Best wishes >>>>>>>>> Emily Deng >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>>> -----Original Message----- >>>>>>>>>> From: Koenig, Christian <Christian.Koenig@xxxxxxx> >>>>>>>>>> Sent: Friday, November 8, 2019 5:08 PM >>>>>>>>>> To: Deng, Emily <Emily.Deng@xxxxxxx>; amd- >>>>>> gfx@xxxxxxxxxxxxxxxxxxxxx >>>>>>>>>> Subject: Re: [PATCH] drm/amdgpu: Fix the null pointer issue >>>>>>>>>> for tdr >>>>>>>>>> >>>>>>>>>> Am 08.11.19 um 09:52 schrieb Deng, Emily: >>>>>>>>>>> Ping..... >>>>>>>>>> You need to give me at least enough time to wake up :) >>>>>>>>>> >>>>>>>>>>> Best wishes >>>>>>>>>>> Emily Deng >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>>> -----Original Message----- >>>>>>>>>>>> From: amd-gfx <amd-gfx-bounces@xxxxxxxxxxxxxxxxxxxxx> On >>>> Behalf >>>>>>>>>>>> Of Deng, Emily >>>>>>>>>>>> Sent: Friday, November 8, 2019 10:56 AM >>>>>>>>>>>> To: Koenig, Christian <Christian.Koenig@xxxxxxx>; amd- >>>>>>>>>>>> gfx@xxxxxxxxxxxxxxxxxxxxx >>>>>>>>>>>> Subject: RE: [PATCH] drm/amdgpu: Fix the null pointer issue >>>>>>>>>>>> for tdr >>>>>>>>>>>> >>>>>>>>>>>>> -----Original Message----- >>>>>>>>>>>>> From: Christian König <ckoenig.leichtzumerken@xxxxxxxxx> >>>>>>>>>>>>> Sent: Thursday, November 7, 2019 7:28 PM >>>>>>>>>>>>> To: Deng, Emily <Emily.Deng@xxxxxxx>; >>>>>>>>>>>>> amd-gfx@xxxxxxxxxxxxxxxxxxxxx >>>>>>>>>>>>> Subject: Re: [PATCH] drm/amdgpu: Fix the null pointer issue >>>>>>>>>>>>> for tdr >>>>>>>>>>>>> >>>>>>>>>>>>> Am 07.11.19 um 11:25 schrieb Emily Deng: >>>>>>>>>>>>>> When the job is already signaled, the s_fence is freed. >>>>>>>>>>>>>> Then it will has null pointer in amdgpu_device_gpu_recover. >>>>>>>>>>>>> NAK, the s_fence is only set to NULL when the job is destroyed. >>>>>>>>>>>>> See drm_sched_job_cleanup(). >>>>>>>>>>>> I know it is set to NULL in drm_sched_job_cleanup. But in >>>>>>>>>>>> one case, when it enter into the amdgpu_device_gpu_recover, >>>>>>>>>>>> it already in drm_sched_job_cleanup, and at this time, it >>>>>>>>>>>> will go to free >>>>>> job. >>>>>>>>>>>> But the amdgpu_device_gpu_recover sometimes is faster. At >>>>>>>>>>>> that time, job is not freed, but s_fence is already NULL. >>>>>>>>>> No, that case can't happen. See here: >>>>>>>>>> >>>>>>>>>>> drm_sched_job_cleanup(s_job); >>>>>>>>>>> >>>>>>>>>>> amdgpu_ring_priority_put(ring, s_job->s_priority); >>>>>>>>>>> dma_fence_put(job->fence); >>>>>>>>>>> amdgpu_sync_free(&job->sync); >>>>>>>>>>> amdgpu_sync_free(&job->sched_sync); >>>>>>>>>>> kfree(job); >>>>>>>>>> The job itself is freed up directly after freeing the >>>>>>>>>> reference to the >>>>>> s_fence. >>>>>>>>>> So you are just papering over a much bigger problem here. This >>>>>>>>>> patch is a clear NAK. >>>>>>>>>> >>>>>>>>>> Regards, >>>>>>>>>> Christian. >>>>>>>>>> >>>>>>>>>>>>> When you see a job without an s_fence then that means the >>>>>>>>>>>>> problem is somewhere else. >>>>>>>>>>>>> >>>>>>>>>>>>> Regards, >>>>>>>>>>>>> Christian. >>>>>>>>>>>>> >>>>>>>>>>>>>> Signed-off-by: Emily Deng <Emily.Deng@xxxxxxx> >>>>>>>>>>>>>> --- >>>>>>>>>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 2 +- >>>>>>>>>>>>>> drivers/gpu/drm/scheduler/sched_main.c | 11 ++++++- >---- >>>>>>>>>>>>>> 2 files changed, 7 insertions(+), 6 deletions(-) >>>>>>>>>>>>>> >>>>>>>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>>>>>>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>>>>>>>>>>>>> index e6ce949..5a8f08e 100644 >>>>>>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>>>>>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>>>>>>>>>>>>> @@ -4075,7 +4075,7 @@ int >>>> amdgpu_device_gpu_recover(struct >>>>>>>>>>>>> amdgpu_device *adev, >>>>>>>>>>>>>> * >>>>>>>>>>>>>> * job->base holds a reference to parent fence >>>>>>>>>>>>>> */ >>>>>>>>>>>>>> - if (job && job->base.s_fence->parent && >>>>>>>>>>>>>> + if (job && job->base.s_fence && >>>>>>>>>>>>>> +job->base.s_fence->parent >>>>>>>> && >>>>>>>>>>>>>> dma_fence_is_signaled(job->base.s_fence->parent)) >>>>>>>>>>>>>> job_signaled = true; >>>>>>>>>>>>>> >>>>>>>>>>>>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c >>>>>>>>>>>>>> b/drivers/gpu/drm/scheduler/sched_main.c >>>>>>>>>>>>>> index 31809ca..56cc10e 100644 >>>>>>>>>>>>>> --- a/drivers/gpu/drm/scheduler/sched_main.c >>>>>>>>>>>>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c >>>>>>>>>>>>>> @@ -334,8 +334,8 @@ void >drm_sched_increase_karma(struct >>>>>>>>>>>>> drm_sched_job >>>>>>>>>>>>>> *bad) >>>>>>>>>>>>>> >>>>>>>>>>>>>> spin_lock(&rq->lock); >>>>>>>>>>>>>> list_for_each_entry_safe(entity, tmp, >>>> &rq- >>>>>>>>> entities, >>>>>>>>>>>>> list) { >>>>>>>>>>>>>> - if (bad->s_fence- >>scheduled.context >>>>>>>> == >>>>>>>>>>>>>> - entity->fence_context) { >>>>>>>>>>>>>> + if (bad->s_fence && (bad- >>s_fence- >>>>>>>>>>>>>> scheduled.context == >>>>>>>>>>>>>> + entity->fence_context)) { >>>>>>>>>>>>>> if (atomic_read(&bad- >>>>>>>>> karma) > >>>>>>>>>>>>>> bad->sched- >>>>> hang_limit) >>>>>>>>>>>>>> if (entity- >>>>> guilty) @@ -376,7 +376,7 @@ void >>>>>>>>>>>>>> drm_sched_stop(struct >>>>>>>> drm_gpu_scheduler >>>>>>>>>>>>> *sched, struct drm_sched_job *bad) >>>>>>>>>>>>>> * This iteration is thread safe as sched thread >>>>>>>>>>>>>> is >>>> stopped. >>>>>>>>>>>>>> */ >>>>>>>>>>>>>> list_for_each_entry_safe_reverse(s_job, tmp, >>>>>>>>>>>>>> &sched- ring_mirror_list, node) { >>>>>>>>>>>>>> - if (s_job->s_fence->parent && >>>>>>>>>>>>>> + if (s_job->s_fence && s_job->s_fence->parent >&& >>>>>>>>>>>>>> dma_fence_remove_callback(s_job- >>>>> s_fence- >>>>>>>>> parent, >>>>>>>>>>>>>> &s_job->cb)) { >>>>>>>>>>>>>> atomic_dec(&sched->hw_rq_count); >>>> @@ - >>>>>>>> 395,7 >>>>>>>>>>>> +395,8 @@ void >>>>>>>>>>>>>> drm_sched_stop(struct drm_gpu_scheduler >>>>>>>>>>>>> *sched, struct drm_sched_job *bad) >>>>>>>>>>>>>> * >>>>>>>>>>>>>> * Job is still alive so fence refcount at >>>> least 1 >>>>>>>>>>>>>> */ >>>>>>>>>>>>>> - dma_fence_wait(&s_job->s_fence- >>finished, >>>>>>>> false); >>>>>>>>>>>>>> + if (s_job->s_fence) >>>>>>>>>>>>>> + dma_fence_wait(&s_job- >>s_fence- >>>>>>>>> finished, >>>>>>>>>>>>> false); >>>>>>>>>>>>>> /* >>>>>>>>>>>>>> * We must keep bad job alive for later >>>> use >>>>>>>> during @@ >>>>>>>>>>>>> -438,7 >>>>>>>>>>>>>> +439,7 @@ void drm_sched_start(struct drm_gpu_scheduler >>>>>> *sched, >>>>>>>>>>>>>> +bool >>>>>>>>>>>>> full_recovery) >>>>>>>>>>>>>> * GPU recovers can't run in parallel. >>>>>>>>>>>>>> */ >>>>>>>>>>>>>> list_for_each_entry_safe(s_job, tmp, >>>>>>>>>>>>>> &sched->ring_mirror_list, >>>>>>>>>>>>>> node) >>>>>>>>>>>>> { >>>>>>>>>>>>>> - struct dma_fence *fence = s_job->s_fence- >>parent; >>>>>>>>>>>>>> + struct dma_fence *fence = s_job->s_fence ? >s_job- >>>>>>>>> s_fence- >>>>>>>>>>>>>> parent : >>>>>>>>>>>>>> +NULL; >>>>>>>>>>>>>> >>>>>>>>>>>>>> atomic_inc(&sched->hw_rq_count); >>>>>>>>>>>>>> >>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>> amd-gfx mailing list >>>>>>>>>>>> amd-gfx@xxxxxxxxxxxxxxxxxxxxx >>>>>>>>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx >> _______________________________________________ >> amd-gfx mailing list >> amd-gfx@xxxxxxxxxxxxxxxxxxxxx >> https://lists.freedesktop.org/mailman/listinfo/amd-gfx _______________________________________________ amd-gfx mailing list amd-gfx@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/amd-gfx