Hi Christian, Sorry, seems I understand wrong. And from the print, the free job's thread is the same as job timeout thread. So seems have some issue in function amdgpu_device_gpu_recover. Best wishes Emily Deng >-----Original Message----- >From: Koenig, Christian <Christian.Koenig@xxxxxxx> >Sent: Friday, November 8, 2019 6:35 PM >To: Deng, Emily <Emily.Deng@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx >Subject: Re: [PATCH] drm/amdgpu: Fix the null pointer issue for tdr > >Hi Emily, > >exactly that can't happen. See here: > >> /* Don't destroy jobs while the timeout worker is running */ >> if (sched->timeout != MAX_SCHEDULE_TIMEOUT && >> !cancel_delayed_work(&sched->work_tdr)) >> return NULL; > >We never free jobs while the timeout working is running to prevent exactly >that issue. > >Regards, >Christian. > >Am 08.11.19 um 11:32 schrieb Deng, Emily: >> Hi Christian, >> The drm_sched_job_timedout-> amdgpu_job_timedout call >amdgpu_device_gpu_recover. I mean the main scheduler free the jobs while >in amdgpu_device_gpu_recover, and before calling drm_sched_stop. >> >> Best wishes >> Emily Deng >> >> >> >>> -----Original Message----- >>> From: Koenig, Christian <Christian.Koenig@xxxxxxx> >>> Sent: Friday, November 8, 2019 6:26 PM >>> To: Deng, Emily <Emily.Deng@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx >>> Subject: Re: [PATCH] drm/amdgpu: Fix the null pointer issue for tdr >>> >>> Hi Emily, >>> >>> well who is calling amdgpu_device_gpu_recover() in this case? >>> >>> When it's not the scheduler we shouldn't have a guilty job in the first place. >>> >>> Regards, >>> Christian. >>> >>> Am 08.11.19 um 11:22 schrieb Deng, Emily: >>>> Hi Chrisitan, >>>> No, I am with the new branch and also has the patch. Even it >>>> are freed by >>> main scheduler, how we could avoid main scheduler to free jobs while >>> enter to function amdgpu_device_gpu_recover? >>>> Best wishes >>>> Emily Deng >>>> >>>> >>>> >>>>> -----Original Message----- >>>>> From: Koenig, Christian <Christian.Koenig@xxxxxxx> >>>>> Sent: Friday, November 8, 2019 6:15 PM >>>>> To: Deng, Emily <Emily.Deng@xxxxxxx>; amd- >gfx@xxxxxxxxxxxxxxxxxxxxx >>>>> Subject: Re: [PATCH] drm/amdgpu: Fix the null pointer issue for tdr >>>>> >>>>> Hi Emily, >>>>> >>>>> in this case you are on an old code branch. >>>>> >>>>> Jobs are freed now by the main scheduler thread and only if no >>>>> timeout handler is running. >>>>> >>>>> See this patch here: >>>>>> commit 5918045c4ed492fb5813f980dcf89a90fefd0a4e >>>>>> Author: Christian König <christian.koenig@xxxxxxx> >>>>>> Date: Thu Apr 18 11:00:21 2019 -0400 >>>>>> >>>>>> drm/scheduler: rework job destruction >>>>> Regards, >>>>> Christian. >>>>> >>>>> Am 08.11.19 um 11:11 schrieb Deng, Emily: >>>>>> Hi Christian, >>>>>> Please refer to follow log, when it enter to >>>>>> amdgpu_device_gpu_recover >>>>> function, the bad job 000000005086879e is freeing in function >>>>> amdgpu_job_free_cb at the same time, because of the hardware fence >>> signal. >>>>> But amdgpu_device_gpu_recover goes faster, at this case, the >>>>> s_fence is already freed, but job is not freed in time. Then this issue >occurs. >>>>>> [ 449.792189] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring >>> sdma0 >>>>>> timeout, signaled seq=2481, emitted seq=2483 [ 449.793202] >>>>>> [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: >>>>> process pid 0 thread pid 0, s_job:000000005086879e [ 449.794163] >>>>> amdgpu >>>>> 0000:00:08.0: GPU reset begin! >>>>>> [ 449.794175] Emily:amdgpu_job_free_cb,Process information: >>>>>> process pid 0 thread pid 0, s_job:000000005086879e [ 449.794221] >>>>>> Emily:amdgpu_job_free_cb,Process information: process pid 0 >>>>>> thread pid 0, s_job:0000000066eb74ab [ 449.794222] >>>>>> Emily:amdgpu_job_free_cb,Process information: process pid 0 >>>>>> thread pid 0, s_job:00000000d4438ad9 [ 449.794255] >>>>>> Emily:amdgpu_job_free_cb,Process information: process pid 0 >>>>>> thread pid 0, s_job:00000000b6d69c65 [ 449.794257] >>>>>> Emily:amdgpu_job_free_cb,Process information: process pid 0 >>>>>> thread pid 0, >>>>> s_job:00000000ea85e922 [ 449.794287] >>>>> Emily:amdgpu_job_free_cb,Process >>>>> information: process pid 0 thread pid 0, s_job:00000000ed3a5ac6 [ >>>>> 449.794366] BUG: unable to handle kernel NULL pointer dereference >>>>> at >>>>> 00000000000000c0 [ 449.800818] PGD 0 P4D 0 [ 449.801040] Oops: >>>>> 0000 [#1] SMP PTI >>>>>> [ 449.801338] CPU: 3 PID: 55 Comm: kworker/3:1 Tainted: G OE >>>>> 4.18.0-15-generic #16~18.04.1-Ubuntu >>>>>> [ 449.802157] Hardware name: QEMU Standard PC (i440FX + PIIX, >>>>>> 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014 [ 449.802944] >>>>>> Workqueue: events drm_sched_job_timedout [amd_sched] [ >>>>>> 449.803488] >>> RIP: >>>>> 0010:amdgpu_device_gpu_recover+0x1da/0xb60 [amdgpu] >>>>>> [ 449.804020] Code: dd ff ff 49 39 c5 48 89 55 a8 0f 85 56 ff ff >>>>>> ff >>>>>> 45 85 e4 0f >>>>> 85 a1 00 00 00 48 8b 45 b0 48 85 c0 0f 84 60 01 00 00 48 8b 40 10 >>>>> <48> 8b >>> 98 >>>>> c0 00 00 00 48 85 db 0f 84 4c 01 00 00 48 8b 43 48 a8 01 >>>>>> [ 449.805593] RSP: 0018:ffffb4c7c08f7d68 EFLAGS: 00010286 [ >>>>>> 449.806032] RAX: 0000000000000000 RBX: 0000000000000000 RCX: >>>>>> 0000000000000000 [ 449.806625] RDX: ffffb4c7c08f5ac0 RSI: >>>>>> 0000000fffffffe0 RDI: 0000000000000246 [ 449.807224] RBP: >>>>>> ffffb4c7c08f7de0 R08: 00000068b9d54000 R09: 0000000000000000 [ >>>>>> 449.807818] R10: 0000000000000000 R11: 0000000000000148 R12: >>>>>> 0000000000000000 [ 449.808411] R13: ffffb4c7c08f7da0 R14: >>>>>> ffff8d82b8525d40 R15: ffff8d82b8525d40 [ 449.809004] FS: >>>>>> 0000000000000000(0000) GS:ffff8d82bfd80000(0000) >>>>>> knlGS:0000000000000000 [ 449.809674] CS: 0010 DS: 0000 ES: 0000 >CR0: >>>>>> 0000000080050033 [ 449.810153] CR2: 00000000000000c0 CR3: >>>>>> 000000003cc0a001 CR4: 00000000003606e0 [ 449.810747] DR0: >>>>> 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ >>>>> 449.811344] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: >>>>> 0000000000000400 [ 449.811937] Call Trace: >>>>>> [ 449.812206] amdgpu_job_timedout+0x114/0x140 [amdgpu] [ >>>>>> 449.812635] drm_sched_job_timedout+0x44/0x90 [amd_sched] [ >>>>>> 449.813139] ? amdgpu_cgs_destroy_device+0x10/0x10 [amdgpu] [ >>>>>> 449.813609] ? drm_sched_job_timedout+0x44/0x90 [amd_sched] [ >>>>>> 449.814077] process_one_work+0x1fd/0x3f0 [ 449.814417] >>>>>> worker_thread+0x34/0x410 [ 449.814728] kthread+0x121/0x140 [ >>>>>> 449.815004] ? process_one_work+0x3f0/0x3f0 [ 449.815374] ? >>>>>> kthread_create_worker_on_cpu+0x70/0x70 >>>>>> [ 449.815799] ret_from_fork+0x35/0x40 >>>>>> >>>>>>> -----Original Message----- >>>>>>> From: Koenig, Christian <Christian.Koenig@xxxxxxx> >>>>>>> Sent: Friday, November 8, 2019 5:43 PM >>>>>>> To: Deng, Emily <Emily.Deng@xxxxxxx>; amd- >>> gfx@xxxxxxxxxxxxxxxxxxxxx >>>>>>> Subject: Re: [PATCH] drm/amdgpu: Fix the null pointer issue for >>>>>>> tdr >>>>>>> >>>>>>> Am 08.11.19 um 10:39 schrieb Deng, Emily: >>>>>>>> Sorry, please take your time. >>>>>>> Have you seen my other response a bit below? >>>>>>> >>>>>>> I can't follow how it would be possible for job->s_fence to be >>>>>>> NULL without the job also being freed. >>>>>>> >>>>>>> So it looks like this patch is just papering over some bigger issues. >>>>>>> >>>>>>> Regards, >>>>>>> Christian. >>>>>>> >>>>>>>> Best wishes >>>>>>>> Emily Deng >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>>> -----Original Message----- >>>>>>>>> From: Koenig, Christian <Christian.Koenig@xxxxxxx> >>>>>>>>> Sent: Friday, November 8, 2019 5:08 PM >>>>>>>>> To: Deng, Emily <Emily.Deng@xxxxxxx>; amd- >>>>> gfx@xxxxxxxxxxxxxxxxxxxxx >>>>>>>>> Subject: Re: [PATCH] drm/amdgpu: Fix the null pointer issue for >>>>>>>>> tdr >>>>>>>>> >>>>>>>>> Am 08.11.19 um 09:52 schrieb Deng, Emily: >>>>>>>>>> Ping..... >>>>>>>>> You need to give me at least enough time to wake up :) >>>>>>>>> >>>>>>>>>> Best wishes >>>>>>>>>> Emily Deng >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>>> -----Original Message----- >>>>>>>>>>> From: amd-gfx <amd-gfx-bounces@xxxxxxxxxxxxxxxxxxxxx> On >>> Behalf >>>>>>>>>>> Of Deng, Emily >>>>>>>>>>> Sent: Friday, November 8, 2019 10:56 AM >>>>>>>>>>> To: Koenig, Christian <Christian.Koenig@xxxxxxx>; amd- >>>>>>>>>>> gfx@xxxxxxxxxxxxxxxxxxxxx >>>>>>>>>>> Subject: RE: [PATCH] drm/amdgpu: Fix the null pointer issue >>>>>>>>>>> for tdr >>>>>>>>>>> >>>>>>>>>>>> -----Original Message----- >>>>>>>>>>>> From: Christian König <ckoenig.leichtzumerken@xxxxxxxxx> >>>>>>>>>>>> Sent: Thursday, November 7, 2019 7:28 PM >>>>>>>>>>>> To: Deng, Emily <Emily.Deng@xxxxxxx>; >>>>>>>>>>>> amd-gfx@xxxxxxxxxxxxxxxxxxxxx >>>>>>>>>>>> Subject: Re: [PATCH] drm/amdgpu: Fix the null pointer issue >>>>>>>>>>>> for tdr >>>>>>>>>>>> >>>>>>>>>>>> Am 07.11.19 um 11:25 schrieb Emily Deng: >>>>>>>>>>>>> When the job is already signaled, the s_fence is freed. >>>>>>>>>>>>> Then it will has null pointer in amdgpu_device_gpu_recover. >>>>>>>>>>>> NAK, the s_fence is only set to NULL when the job is destroyed. >>>>>>>>>>>> See drm_sched_job_cleanup(). >>>>>>>>>>> I know it is set to NULL in drm_sched_job_cleanup. But in one >>>>>>>>>>> case, when it enter into the amdgpu_device_gpu_recover, it >>>>>>>>>>> already in drm_sched_job_cleanup, and at this time, it will >>>>>>>>>>> go to free >>>>> job. >>>>>>>>>>> But the amdgpu_device_gpu_recover sometimes is faster. At >>>>>>>>>>> that time, job is not freed, but s_fence is already NULL. >>>>>>>>> No, that case can't happen. See here: >>>>>>>>> >>>>>>>>>> drm_sched_job_cleanup(s_job); >>>>>>>>>> >>>>>>>>>> amdgpu_ring_priority_put(ring, s_job->s_priority); >>>>>>>>>> dma_fence_put(job->fence); >>>>>>>>>> amdgpu_sync_free(&job->sync); >>>>>>>>>> amdgpu_sync_free(&job->sched_sync); >>>>>>>>>> kfree(job); >>>>>>>>> The job itself is freed up directly after freeing the reference >>>>>>>>> to the >>>>> s_fence. >>>>>>>>> So you are just papering over a much bigger problem here. This >>>>>>>>> patch is a clear NAK. >>>>>>>>> >>>>>>>>> Regards, >>>>>>>>> Christian. >>>>>>>>> >>>>>>>>>>>> When you see a job without an s_fence then that means the >>>>>>>>>>>> problem is somewhere else. >>>>>>>>>>>> >>>>>>>>>>>> Regards, >>>>>>>>>>>> Christian. >>>>>>>>>>>> >>>>>>>>>>>>> Signed-off-by: Emily Deng <Emily.Deng@xxxxxxx> >>>>>>>>>>>>> --- >>>>>>>>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 2 +- >>>>>>>>>>>>> drivers/gpu/drm/scheduler/sched_main.c | 11 ++++++--- >-- >>>>>>>>>>>>> 2 files changed, 7 insertions(+), 6 deletions(-) >>>>>>>>>>>>> >>>>>>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>>>>>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>>>>>>>>>>>> index e6ce949..5a8f08e 100644 >>>>>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>>>>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>>>>>>>>>>>> @@ -4075,7 +4075,7 @@ int >>> amdgpu_device_gpu_recover(struct >>>>>>>>>>>> amdgpu_device *adev, >>>>>>>>>>>>> * >>>>>>>>>>>>> * job->base holds a reference to parent fence >>>>>>>>>>>>> */ >>>>>>>>>>>>> - if (job && job->base.s_fence->parent && >>>>>>>>>>>>> + if (job && job->base.s_fence && job->base.s_fence- >>parent >>>>>>> && >>>>>>>>>>>>> dma_fence_is_signaled(job->base.s_fence->parent)) >>>>>>>>>>>>> job_signaled = true; >>>>>>>>>>>>> >>>>>>>>>>>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c >>>>>>>>>>>>> b/drivers/gpu/drm/scheduler/sched_main.c >>>>>>>>>>>>> index 31809ca..56cc10e 100644 >>>>>>>>>>>>> --- a/drivers/gpu/drm/scheduler/sched_main.c >>>>>>>>>>>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c >>>>>>>>>>>>> @@ -334,8 +334,8 @@ void >drm_sched_increase_karma(struct >>>>>>>>>>>> drm_sched_job >>>>>>>>>>>>> *bad) >>>>>>>>>>>>> >>>>>>>>>>>>> spin_lock(&rq->lock); >>>>>>>>>>>>> list_for_each_entry_safe(entity, tmp, >>> &rq- >>>>>>>> entities, >>>>>>>>>>>> list) { >>>>>>>>>>>>> - if (bad->s_fence- >>scheduled.context >>>>>>> == >>>>>>>>>>>>> - entity->fence_context) { >>>>>>>>>>>>> + if (bad->s_fence && (bad- >>s_fence- >>>>>>>>>>>>> scheduled.context == >>>>>>>>>>>>> + entity->fence_context)) { >>>>>>>>>>>>> if (atomic_read(&bad- >>>>>>>> karma) > >>>>>>>>>>>>> bad->sched- >>>> hang_limit) >>>>>>>>>>>>> if (entity- >>>> guilty) @@ -376,7 +376,7 @@ void >>>>>>>>>>>>> drm_sched_stop(struct >>>>>>> drm_gpu_scheduler >>>>>>>>>>>> *sched, struct drm_sched_job *bad) >>>>>>>>>>>>> * This iteration is thread safe as sched thread is >>> stopped. >>>>>>>>>>>>> */ >>>>>>>>>>>>> list_for_each_entry_safe_reverse(s_job, tmp, >>>>>>>>>>>>> &sched- ring_mirror_list, node) { >>>>>>>>>>>>> - if (s_job->s_fence->parent && >>>>>>>>>>>>> + if (s_job->s_fence && s_job->s_fence->parent >&& >>>>>>>>>>>>> dma_fence_remove_callback(s_job- >>>> s_fence- >>>>>>>> parent, >>>>>>>>>>>>> &s_job->cb)) { >>>>>>>>>>>>> atomic_dec(&sched->hw_rq_count); >>> @@ - >>>>>>> 395,7 >>>>>>>>>>> +395,8 @@ void >>>>>>>>>>>>> drm_sched_stop(struct drm_gpu_scheduler >>>>>>>>>>>> *sched, struct drm_sched_job *bad) >>>>>>>>>>>>> * >>>>>>>>>>>>> * Job is still alive so fence refcount at >>> least 1 >>>>>>>>>>>>> */ >>>>>>>>>>>>> - dma_fence_wait(&s_job->s_fence- >>finished, >>>>>>> false); >>>>>>>>>>>>> + if (s_job->s_fence) >>>>>>>>>>>>> + dma_fence_wait(&s_job- >>s_fence- >>>>>>>> finished, >>>>>>>>>>>> false); >>>>>>>>>>>>> /* >>>>>>>>>>>>> * We must keep bad job alive for later >>> use >>>>>>> during @@ >>>>>>>>>>>> -438,7 >>>>>>>>>>>>> +439,7 @@ void drm_sched_start(struct drm_gpu_scheduler >>>>> *sched, >>>>>>>>>>>>> +bool >>>>>>>>>>>> full_recovery) >>>>>>>>>>>>> * GPU recovers can't run in parallel. >>>>>>>>>>>>> */ >>>>>>>>>>>>> list_for_each_entry_safe(s_job, tmp, >>>>>>>>>>>>> &sched->ring_mirror_list, >>>>>>>>>>>>> node) >>>>>>>>>>>> { >>>>>>>>>>>>> - struct dma_fence *fence = s_job->s_fence- >>parent; >>>>>>>>>>>>> + struct dma_fence *fence = s_job->s_fence ? >s_job- >>>>>>>> s_fence- >>>>>>>>>>>>> parent : >>>>>>>>>>>>> +NULL; >>>>>>>>>>>>> >>>>>>>>>>>>> atomic_inc(&sched->hw_rq_count); >>>>>>>>>>>>> >>>>>>>>>>> _______________________________________________ >>>>>>>>>>> amd-gfx mailing list >>>>>>>>>>> amd-gfx@xxxxxxxxxxxxxxxxxxxxx >>>>>>>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx _______________________________________________ amd-gfx mailing list amd-gfx@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/amd-gfx