Hi Christian and Andrey, The issue I encountered is the bad job is freeing after entering to the amdgpu_device_gpu_recover. Don't know why, as per Christian said, it will call cancel_delayed_work in drm_sched_cleanup_jobs. Best wishes Emily Deng >-----Original Message----- >From: amd-gfx <amd-gfx-bounces@xxxxxxxxxxxxxxxxxxxxx> On Behalf Of Deng, >Emily >Sent: Monday, November 11, 2019 3:19 PM >To: Grodzovsky, Andrey <Andrey.Grodzovsky@xxxxxxx>; Koenig, Christian ><Christian.Koenig@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx >Subject: RE: [PATCH] drm/amdgpu: Fix the null pointer issue for tdr > >Hi Andrey, > I don’t think your patch will help for this. As it will may call >kthread_should_park in drm_sched_cleanup_jobs first, and then call >kcl_kthread_park. And then it still has a race between the 2 threads. > >Best wishes >Emily Deng > > > >>-----Original Message----- >>From: Grodzovsky, Andrey <Andrey.Grodzovsky@xxxxxxx> >>Sent: Saturday, November 9, 2019 3:01 AM >>To: Koenig, Christian <Christian.Koenig@xxxxxxx>; Deng, Emily >><Emily.Deng@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx >>Subject: Re: [PATCH] drm/amdgpu: Fix the null pointer issue for tdr >> >> >>On 11/8/19 5:35 AM, Koenig, Christian wrote: >>> Hi Emily, >>> >>> exactly that can't happen. See here: >>> >>>> /* Don't destroy jobs while the timeout worker is running >>>> */ >>>> if (sched->timeout != MAX_SCHEDULE_TIMEOUT && >>>> !cancel_delayed_work(&sched->work_tdr)) >>>> return NULL; >>> We never free jobs while the timeout working is running to prevent >>> exactly that issue. >> >> >>I don't think this protects us if drm_sched_cleanup_jobs is called for >>scheduler which didn't experience a timeout, in >>amdgpu_device_gpu_recover we access >>sched->ring_mirror_list for all the schedulers on a device so this >>sched->condition >>above won't protect us. What in fact could help maybe is my recent >>patch >>541c521 drm/sched: Avoid job cleanup if sched thread is parked. because >>we do park each of the scheduler threads during tdr job before trying >>to access >>sched->ring_mirror_list. >> >>Emily - did you see this problem with that patch in place ? I only >>pushed it yesterday. >> >>Andrey >> >> >>> >>> Regards, >>> Christian. >>> >>> Am 08.11.19 um 11:32 schrieb Deng, Emily: >>>> Hi Christian, >>>> The drm_sched_job_timedout-> amdgpu_job_timedout call >>amdgpu_device_gpu_recover. I mean the main scheduler free the jobs >>while in amdgpu_device_gpu_recover, and before calling drm_sched_stop. >>>> >>>> Best wishes >>>> Emily Deng >>>> >>>> >>>> >>>>> -----Original Message----- >>>>> From: Koenig, Christian <Christian.Koenig@xxxxxxx> >>>>> Sent: Friday, November 8, 2019 6:26 PM >>>>> To: Deng, Emily <Emily.Deng@xxxxxxx>; amd- >gfx@xxxxxxxxxxxxxxxxxxxxx >>>>> Subject: Re: [PATCH] drm/amdgpu: Fix the null pointer issue for tdr >>>>> >>>>> Hi Emily, >>>>> >>>>> well who is calling amdgpu_device_gpu_recover() in this case? >>>>> >>>>> When it's not the scheduler we shouldn't have a guilty job in the first >place. >>>>> >>>>> Regards, >>>>> Christian. >>>>> >>>>> Am 08.11.19 um 11:22 schrieb Deng, Emily: >>>>>> Hi Chrisitan, >>>>>> No, I am with the new branch and also has the patch. Even >>>>>> it are freed by >>>>> main scheduler, how we could avoid main scheduler to free jobs >>>>> while enter to function amdgpu_device_gpu_recover? >>>>>> Best wishes >>>>>> Emily Deng >>>>>> >>>>>> >>>>>> >>>>>>> -----Original Message----- >>>>>>> From: Koenig, Christian <Christian.Koenig@xxxxxxx> >>>>>>> Sent: Friday, November 8, 2019 6:15 PM >>>>>>> To: Deng, Emily <Emily.Deng@xxxxxxx>; >>>>>>> amd-gfx@xxxxxxxxxxxxxxxxxxxxx >>>>>>> Subject: Re: [PATCH] drm/amdgpu: Fix the null pointer issue for >>>>>>> tdr >>>>>>> >>>>>>> Hi Emily, >>>>>>> >>>>>>> in this case you are on an old code branch. >>>>>>> >>>>>>> Jobs are freed now by the main scheduler thread and only if no >>>>>>> timeout handler is running. >>>>>>> >>>>>>> See this patch here: >>>>>>>> commit 5918045c4ed492fb5813f980dcf89a90fefd0a4e >>>>>>>> Author: Christian König <christian.koenig@xxxxxxx> >>>>>>>> Date: Thu Apr 18 11:00:21 2019 -0400 >>>>>>>> >>>>>>>> drm/scheduler: rework job destruction >>>>>>> Regards, >>>>>>> Christian. >>>>>>> >>>>>>> Am 08.11.19 um 11:11 schrieb Deng, Emily: >>>>>>>> Hi Christian, >>>>>>>> Please refer to follow log, when it enter to >>>>>>>> amdgpu_device_gpu_recover >>>>>>> function, the bad job 000000005086879e is freeing in function >>>>>>> amdgpu_job_free_cb at the same time, because of the hardware >>>>>>> fence >>>>> signal. >>>>>>> But amdgpu_device_gpu_recover goes faster, at this case, the >>>>>>> s_fence is already freed, but job is not freed in time. Then this >>>>>>> issue >>occurs. >>>>>>>> [ 449.792189] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring >>>>> sdma0 >>>>>>>> timeout, signaled seq=2481, emitted seq=2483 [ 449.793202] >>>>>>>> [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process >information: >>>>>>> process pid 0 thread pid 0, s_job:000000005086879e [ >>>>>>> 449.794163] amdgpu >>>>>>> 0000:00:08.0: GPU reset begin! >>>>>>>> [ 449.794175] Emily:amdgpu_job_free_cb,Process information: >>>>>>>> process pid 0 thread pid 0, s_job:000000005086879e [ >>>>>>>> 449.794221] Emily:amdgpu_job_free_cb,Process information: >>>>>>>> process pid 0 thread pid 0, s_job:0000000066eb74ab [ >>>>>>>> 449.794222] Emily:amdgpu_job_free_cb,Process information: >>>>>>>> process pid 0 thread pid 0, s_job:00000000d4438ad9 [ >>>>>>>> 449.794255] Emily:amdgpu_job_free_cb,Process information: >>>>>>>> process pid 0 thread pid 0, s_job:00000000b6d69c65 [ >>>>>>>> 449.794257] Emily:amdgpu_job_free_cb,Process information: >>>>>>>> process pid 0 thread pid 0, >>>>>>> s_job:00000000ea85e922 [ 449.794287] >>>>>>> Emily:amdgpu_job_free_cb,Process >>>>>>> information: process pid 0 thread pid 0, s_job:00000000ed3a5ac6 >>>>>>> [ 449.794366] BUG: unable to handle kernel NULL pointer >>>>>>> dereference at >>>>>>> 00000000000000c0 [ 449.800818] PGD 0 P4D 0 [ 449.801040] Oops: >>>>>>> 0000 [#1] SMP PTI >>>>>>>> [ 449.801338] CPU: 3 PID: 55 Comm: kworker/3:1 Tainted: G >OE >>>>>>> 4.18.0-15-generic #16~18.04.1-Ubuntu >>>>>>>> [ 449.802157] Hardware name: QEMU Standard PC (i440FX + PIIX, >>>>>>>> 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014 [ 449.802944] >>>>>>>> Workqueue: events drm_sched_job_timedout [amd_sched] [ >>>>>>>> 449.803488] >>>>> RIP: >>>>>>> 0010:amdgpu_device_gpu_recover+0x1da/0xb60 [amdgpu] >>>>>>>> [ 449.804020] Code: dd ff ff 49 39 c5 48 89 55 a8 0f 85 56 ff >>>>>>>> ff ff >>>>>>>> 45 85 e4 0f >>>>>>> 85 a1 00 00 00 48 8b 45 b0 48 85 c0 0f 84 60 01 00 00 48 8b 40 10 >>>>>>> <48> 8b >>>>> 98 >>>>>>> c0 00 00 00 48 85 db 0f 84 4c 01 00 00 48 8b 43 48 a8 01 >>>>>>>> [ 449.805593] RSP: 0018:ffffb4c7c08f7d68 EFLAGS: 00010286 [ >>>>>>>> 449.806032] RAX: 0000000000000000 RBX: 0000000000000000 RCX: >>>>>>>> 0000000000000000 [ 449.806625] RDX: ffffb4c7c08f5ac0 RSI: >>>>>>>> 0000000fffffffe0 RDI: 0000000000000246 [ 449.807224] RBP: >>>>>>>> ffffb4c7c08f7de0 R08: 00000068b9d54000 R09: 0000000000000000 >[ >>>>>>>> 449.807818] R10: 0000000000000000 R11: 0000000000000148 R12: >>>>>>>> 0000000000000000 [ 449.808411] R13: ffffb4c7c08f7da0 R14: >>>>>>>> ffff8d82b8525d40 R15: ffff8d82b8525d40 [ 449.809004] FS: >>>>>>>> 0000000000000000(0000) GS:ffff8d82bfd80000(0000) >>>>>>>> knlGS:0000000000000000 [ 449.809674] CS: 0010 DS: 0000 ES: >>>>>>>> 0000 >>CR0: >>>>>>>> 0000000080050033 [ 449.810153] CR2: 00000000000000c0 CR3: >>>>>>>> 000000003cc0a001 CR4: 00000000003606e0 [ 449.810747] DR0: >>>>>>> 0000000000000000 DR1: 0000000000000000 DR2: >0000000000000000 >>[ >>>>>>> 449.811344] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: >>>>>>> 0000000000000400 [ 449.811937] Call Trace: >>>>>>>> [ 449.812206] amdgpu_job_timedout+0x114/0x140 [amdgpu] [ >>>>>>>> 449.812635] drm_sched_job_timedout+0x44/0x90 [amd_sched] [ >>>>>>>> 449.813139] ? amdgpu_cgs_destroy_device+0x10/0x10 [amdgpu] [ >>>>>>>> 449.813609] ? drm_sched_job_timedout+0x44/0x90 [amd_sched] [ >>>>>>>> 449.814077] process_one_work+0x1fd/0x3f0 [ 449.814417] >>>>>>>> worker_thread+0x34/0x410 [ 449.814728] kthread+0x121/0x140 [ >>>>>>>> 449.815004] ? process_one_work+0x3f0/0x3f0 [ 449.815374] ? >>>>>>>> kthread_create_worker_on_cpu+0x70/0x70 >>>>>>>> [ 449.815799] ret_from_fork+0x35/0x40 >>>>>>>> >>>>>>>>> -----Original Message----- >>>>>>>>> From: Koenig, Christian <Christian.Koenig@xxxxxxx> >>>>>>>>> Sent: Friday, November 8, 2019 5:43 PM >>>>>>>>> To: Deng, Emily <Emily.Deng@xxxxxxx>; amd- >>>>> gfx@xxxxxxxxxxxxxxxxxxxxx >>>>>>>>> Subject: Re: [PATCH] drm/amdgpu: Fix the null pointer issue for >>>>>>>>> tdr >>>>>>>>> >>>>>>>>> Am 08.11.19 um 10:39 schrieb Deng, Emily: >>>>>>>>>> Sorry, please take your time. >>>>>>>>> Have you seen my other response a bit below? >>>>>>>>> >>>>>>>>> I can't follow how it would be possible for job->s_fence to be >>>>>>>>> NULL without the job also being freed. >>>>>>>>> >>>>>>>>> So it looks like this patch is just papering over some bigger issues. >>>>>>>>> >>>>>>>>> Regards, >>>>>>>>> Christian. >>>>>>>>> >>>>>>>>>> Best wishes >>>>>>>>>> Emily Deng >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>>> -----Original Message----- >>>>>>>>>>> From: Koenig, Christian <Christian.Koenig@xxxxxxx> >>>>>>>>>>> Sent: Friday, November 8, 2019 5:08 PM >>>>>>>>>>> To: Deng, Emily <Emily.Deng@xxxxxxx>; amd- >>>>>>> gfx@xxxxxxxxxxxxxxxxxxxxx >>>>>>>>>>> Subject: Re: [PATCH] drm/amdgpu: Fix the null pointer issue >>>>>>>>>>> for tdr >>>>>>>>>>> >>>>>>>>>>> Am 08.11.19 um 09:52 schrieb Deng, Emily: >>>>>>>>>>>> Ping..... >>>>>>>>>>> You need to give me at least enough time to wake up :) >>>>>>>>>>> >>>>>>>>>>>> Best wishes >>>>>>>>>>>> Emily Deng >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>> -----Original Message----- >>>>>>>>>>>>> From: amd-gfx <amd-gfx-bounces@xxxxxxxxxxxxxxxxxxxxx> On >>>>> Behalf >>>>>>>>>>>>> Of Deng, Emily >>>>>>>>>>>>> Sent: Friday, November 8, 2019 10:56 AM >>>>>>>>>>>>> To: Koenig, Christian <Christian.Koenig@xxxxxxx>; amd- >>>>>>>>>>>>> gfx@xxxxxxxxxxxxxxxxxxxxx >>>>>>>>>>>>> Subject: RE: [PATCH] drm/amdgpu: Fix the null pointer issue >>>>>>>>>>>>> for tdr >>>>>>>>>>>>> >>>>>>>>>>>>>> -----Original Message----- >>>>>>>>>>>>>> From: Christian König <ckoenig.leichtzumerken@xxxxxxxxx> >>>>>>>>>>>>>> Sent: Thursday, November 7, 2019 7:28 PM >>>>>>>>>>>>>> To: Deng, Emily <Emily.Deng@xxxxxxx>; >>>>>>>>>>>>>> amd-gfx@xxxxxxxxxxxxxxxxxxxxx >>>>>>>>>>>>>> Subject: Re: [PATCH] drm/amdgpu: Fix the null pointer >>>>>>>>>>>>>> issue for tdr >>>>>>>>>>>>>> >>>>>>>>>>>>>> Am 07.11.19 um 11:25 schrieb Emily Deng: >>>>>>>>>>>>>>> When the job is already signaled, the s_fence is freed. >>>>>>>>>>>>>>> Then it will has null pointer in amdgpu_device_gpu_recover. >>>>>>>>>>>>>> NAK, the s_fence is only set to NULL when the job is >destroyed. >>>>>>>>>>>>>> See drm_sched_job_cleanup(). >>>>>>>>>>>>> I know it is set to NULL in drm_sched_job_cleanup. But in >>>>>>>>>>>>> one case, when it enter into the amdgpu_device_gpu_recover, >>>>>>>>>>>>> it already in drm_sched_job_cleanup, and at this time, it >>>>>>>>>>>>> will go to free >>>>>>> job. >>>>>>>>>>>>> But the amdgpu_device_gpu_recover sometimes is faster. At >>>>>>>>>>>>> that time, job is not freed, but s_fence is already NULL. >>>>>>>>>>> No, that case can't happen. See here: >>>>>>>>>>> >>>>>>>>>>>> drm_sched_job_cleanup(s_job); >>>>>>>>>>>> >>>>>>>>>>>> amdgpu_ring_priority_put(ring, s_job->s_priority); >>>>>>>>>>>> dma_fence_put(job->fence); >>>>>>>>>>>> amdgpu_sync_free(&job->sync); >>>>>>>>>>>> amdgpu_sync_free(&job->sched_sync); >>>>>>>>>>>> kfree(job); >>>>>>>>>>> The job itself is freed up directly after freeing the >>>>>>>>>>> reference to the >>>>>>> s_fence. >>>>>>>>>>> So you are just papering over a much bigger problem here. >>>>>>>>>>> This patch is a clear NAK. >>>>>>>>>>> >>>>>>>>>>> Regards, >>>>>>>>>>> Christian. >>>>>>>>>>> >>>>>>>>>>>>>> When you see a job without an s_fence then that means the >>>>>>>>>>>>>> problem is somewhere else. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Regards, >>>>>>>>>>>>>> Christian. >>>>>>>>>>>>>> >>>>>>>>>>>>>>> Signed-off-by: Emily Deng <Emily.Deng@xxxxxxx> >>>>>>>>>>>>>>> --- >>>>>>>>>>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 2 +- >>>>>>>>>>>>>>> drivers/gpu/drm/scheduler/sched_main.c | 11 >++++++- >>---- >>>>>>>>>>>>>>> 2 files changed, 7 insertions(+), 6 deletions(-) >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>>>>>>>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>>>>>>>>>>>>>> index e6ce949..5a8f08e 100644 >>>>>>>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>>>>>>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>>>>>>>>>>>>>> @@ -4075,7 +4075,7 @@ int >>>>> amdgpu_device_gpu_recover(struct >>>>>>>>>>>>>> amdgpu_device *adev, >>>>>>>>>>>>>>> * >>>>>>>>>>>>>>> * job->base holds a reference to parent fence >>>>>>>>>>>>>>> */ >>>>>>>>>>>>>>> - if (job && job->base.s_fence->parent && >>>>>>>>>>>>>>> + if (job && job->base.s_fence && >>>>>>>>>>>>>>> +job->base.s_fence->parent >>>>>>>>> && >>>>>>>>>>>>>>> dma_fence_is_signaled(job->base.s_fence->parent)) >>>>>>>>>>>>>>> job_signaled = true; >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c >>>>>>>>>>>>>>> b/drivers/gpu/drm/scheduler/sched_main.c >>>>>>>>>>>>>>> index 31809ca..56cc10e 100644 >>>>>>>>>>>>>>> --- a/drivers/gpu/drm/scheduler/sched_main.c >>>>>>>>>>>>>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c >>>>>>>>>>>>>>> @@ -334,8 +334,8 @@ void >>drm_sched_increase_karma(struct >>>>>>>>>>>>>> drm_sched_job >>>>>>>>>>>>>>> *bad) >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> spin_lock(&rq->lock); >>>>>>>>>>>>>>> list_for_each_entry_safe(entity, tmp, >>>>> &rq- >>>>>>>>>> entities, >>>>>>>>>>>>>> list) { >>>>>>>>>>>>>>> - if (bad->s_fence- >>>scheduled.context >>>>>>>>> == >>>>>>>>>>>>>>> - entity->fence_context) { >>>>>>>>>>>>>>> + if (bad->s_fence && (bad- >>>s_fence- >>>>>>>>>>>>>>> scheduled.context == >>>>>>>>>>>>>>> + entity->fence_context)) { >>>>>>>>>>>>>>> if (atomic_read(&bad- >>>>>>>>>> karma) > >>>>>>>>>>>>>>> bad->sched- >>>>>> hang_limit) >>>>>>>>>>>>>>> if (entity- >>>>>> guilty) @@ -376,7 +376,7 @@ void >>>>>>>>>>>>>>> drm_sched_stop(struct >>>>>>>>> drm_gpu_scheduler >>>>>>>>>>>>>> *sched, struct drm_sched_job *bad) >>>>>>>>>>>>>>> * This iteration is thread safe as sched thread >>>>>>>>>>>>>>> is >>>>> stopped. >>>>>>>>>>>>>>> */ >>>>>>>>>>>>>>> list_for_each_entry_safe_reverse(s_job, tmp, >>>>>>>>>>>>>>> &sched- ring_mirror_list, node) { >>>>>>>>>>>>>>> - if (s_job->s_fence->parent && >>>>>>>>>>>>>>> + if (s_job->s_fence && s_job->s_fence->parent >>&& >>>>>>>>>>>>>>> dma_fence_remove_callback(s_job- >>>>>> s_fence- >>>>>>>>>> parent, >>>>>>>>>>>>>>> &s_job->cb)) { >>>>>>>>>>>>>>> atomic_dec(&sched->hw_rq_count); >>>>> @@ - >>>>>>>>> 395,7 >>>>>>>>>>>>> +395,8 @@ void >>>>>>>>>>>>>>> drm_sched_stop(struct drm_gpu_scheduler >>>>>>>>>>>>>> *sched, struct drm_sched_job *bad) >>>>>>>>>>>>>>> * >>>>>>>>>>>>>>> * Job is still alive so fence refcount at >>>>> least 1 >>>>>>>>>>>>>>> */ >>>>>>>>>>>>>>> - dma_fence_wait(&s_job->s_fence- >>>finished, >>>>>>>>> false); >>>>>>>>>>>>>>> + if (s_job->s_fence) >>>>>>>>>>>>>>> + dma_fence_wait(&s_job- >>>s_fence- >>>>>>>>>> finished, >>>>>>>>>>>>>> false); >>>>>>>>>>>>>>> /* >>>>>>>>>>>>>>> * We must keep bad job alive for later >>>>> use >>>>>>>>> during @@ >>>>>>>>>>>>>> -438,7 >>>>>>>>>>>>>>> +439,7 @@ void drm_sched_start(struct >drm_gpu_scheduler >>>>>>> *sched, >>>>>>>>>>>>>>> +bool >>>>>>>>>>>>>> full_recovery) >>>>>>>>>>>>>>> * GPU recovers can't run in parallel. >>>>>>>>>>>>>>> */ >>>>>>>>>>>>>>> list_for_each_entry_safe(s_job, tmp, >>>>>>>>>>>>>>> &sched->ring_mirror_list, >>>>>>>>>>>>>>> node) >>>>>>>>>>>>>> { >>>>>>>>>>>>>>> - struct dma_fence *fence = s_job->s_fence- >>>parent; >>>>>>>>>>>>>>> + struct dma_fence *fence = s_job->s_fence ? >>s_job- >>>>>>>>>> s_fence- >>>>>>>>>>>>>>> parent : >>>>>>>>>>>>>>> +NULL; >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> atomic_inc(&sched->hw_rq_count); >>>>>>>>>>>>>>> >>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>> amd-gfx mailing list >>>>>>>>>>>>> amd-gfx@xxxxxxxxxxxxxxxxxxxxx >>>>>>>>>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx >>> _______________________________________________ >>> amd-gfx mailing list >>> amd-gfx@xxxxxxxxxxxxxxxxxxxxx >>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx >_______________________________________________ >amd-gfx mailing list >amd-gfx@xxxxxxxxxxxxxxxxxxxxx >https://lists.freedesktop.org/mailman/listinfo/amd-gfx _______________________________________________ amd-gfx mailing list amd-gfx@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/amd-gfx