Hi Emily, well who is calling amdgpu_device_gpu_recover() in this case? When it's not the scheduler we shouldn't have a guilty job in the first place. Regards, Christian. Am 08.11.19 um 11:22 schrieb Deng, Emily: > Hi Chrisitan, > No, I am with the new branch and also has the patch. Even it are freed by main scheduler, how we could avoid main scheduler to free jobs while enter to function amdgpu_device_gpu_recover? > > Best wishes > Emily Deng > > > >> -----Original Message----- >> From: Koenig, Christian <Christian.Koenig@xxxxxxx> >> Sent: Friday, November 8, 2019 6:15 PM >> To: Deng, Emily <Emily.Deng@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx >> Subject: Re: [PATCH] drm/amdgpu: Fix the null pointer issue for tdr >> >> Hi Emily, >> >> in this case you are on an old code branch. >> >> Jobs are freed now by the main scheduler thread and only if no timeout >> handler is running. >> >> See this patch here: >>> commit 5918045c4ed492fb5813f980dcf89a90fefd0a4e >>> Author: Christian König <christian.koenig@xxxxxxx> >>> Date: Thu Apr 18 11:00:21 2019 -0400 >>> >>> drm/scheduler: rework job destruction >> Regards, >> Christian. >> >> Am 08.11.19 um 11:11 schrieb Deng, Emily: >>> Hi Christian, >>> Please refer to follow log, when it enter to amdgpu_device_gpu_recover >> function, the bad job 000000005086879e is freeing in function >> amdgpu_job_free_cb at the same time, because of the hardware fence signal. >> But amdgpu_device_gpu_recover goes faster, at this case, the s_fence is >> already freed, but job is not freed in time. Then this issue occurs. >>> [ 449.792189] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 >>> timeout, signaled seq=2481, emitted seq=2483 [ 449.793202] >>> [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: >> process pid 0 thread pid 0, s_job:000000005086879e [ 449.794163] amdgpu >> 0000:00:08.0: GPU reset begin! >>> [ 449.794175] Emily:amdgpu_job_free_cb,Process information: process >>> pid 0 thread pid 0, s_job:000000005086879e [ 449.794221] >>> Emily:amdgpu_job_free_cb,Process information: process pid 0 thread >>> pid 0, s_job:0000000066eb74ab [ 449.794222] >>> Emily:amdgpu_job_free_cb,Process information: process pid 0 thread >>> pid 0, s_job:00000000d4438ad9 [ 449.794255] >>> Emily:amdgpu_job_free_cb,Process information: process pid 0 thread >>> pid 0, s_job:00000000b6d69c65 [ 449.794257] >>> Emily:amdgpu_job_free_cb,Process information: process pid 0 thread pid 0, >> s_job:00000000ea85e922 [ 449.794287] Emily:amdgpu_job_free_cb,Process >> information: process pid 0 thread pid 0, s_job:00000000ed3a5ac6 >> [ 449.794366] BUG: unable to handle kernel NULL pointer dereference at >> 00000000000000c0 [ 449.800818] PGD 0 P4D 0 [ 449.801040] Oops: 0000 >> [#1] SMP PTI >>> [ 449.801338] CPU: 3 PID: 55 Comm: kworker/3:1 Tainted: G OE >> 4.18.0-15-generic #16~18.04.1-Ubuntu >>> [ 449.802157] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), >>> BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014 [ 449.802944] Workqueue: events >>> drm_sched_job_timedout [amd_sched] [ 449.803488] RIP: >> 0010:amdgpu_device_gpu_recover+0x1da/0xb60 [amdgpu] >>> [ 449.804020] Code: dd ff ff 49 39 c5 48 89 55 a8 0f 85 56 ff ff ff 45 85 e4 0f >> 85 a1 00 00 00 48 8b 45 b0 48 85 c0 0f 84 60 01 00 00 48 8b 40 10 <48> 8b 98 >> c0 00 00 00 48 85 db 0f 84 4c 01 00 00 48 8b 43 48 a8 01 >>> [ 449.805593] RSP: 0018:ffffb4c7c08f7d68 EFLAGS: 00010286 [ >>> 449.806032] RAX: 0000000000000000 RBX: 0000000000000000 RCX: >>> 0000000000000000 [ 449.806625] RDX: ffffb4c7c08f5ac0 RSI: >>> 0000000fffffffe0 RDI: 0000000000000246 [ 449.807224] RBP: >>> ffffb4c7c08f7de0 R08: 00000068b9d54000 R09: 0000000000000000 [ >>> 449.807818] R10: 0000000000000000 R11: 0000000000000148 R12: >>> 0000000000000000 [ 449.808411] R13: ffffb4c7c08f7da0 R14: >>> ffff8d82b8525d40 R15: ffff8d82b8525d40 [ 449.809004] FS: >>> 0000000000000000(0000) GS:ffff8d82bfd80000(0000) >>> knlGS:0000000000000000 [ 449.809674] CS: 0010 DS: 0000 ES: 0000 CR0: >>> 0000000080050033 [ 449.810153] CR2: 00000000000000c0 CR3: >>> 000000003cc0a001 CR4: 00000000003606e0 [ 449.810747] DR0: >> 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 >> [ 449.811344] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: >> 0000000000000400 [ 449.811937] Call Trace: >>> [ 449.812206] amdgpu_job_timedout+0x114/0x140 [amdgpu] [ >>> 449.812635] drm_sched_job_timedout+0x44/0x90 [amd_sched] [ >>> 449.813139] ? amdgpu_cgs_destroy_device+0x10/0x10 [amdgpu] [ >>> 449.813609] ? drm_sched_job_timedout+0x44/0x90 [amd_sched] [ >>> 449.814077] process_one_work+0x1fd/0x3f0 [ 449.814417] >>> worker_thread+0x34/0x410 [ 449.814728] kthread+0x121/0x140 [ >>> 449.815004] ? process_one_work+0x3f0/0x3f0 [ 449.815374] ? >>> kthread_create_worker_on_cpu+0x70/0x70 >>> [ 449.815799] ret_from_fork+0x35/0x40 >>> >>>> -----Original Message----- >>>> From: Koenig, Christian <Christian.Koenig@xxxxxxx> >>>> Sent: Friday, November 8, 2019 5:43 PM >>>> To: Deng, Emily <Emily.Deng@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx >>>> Subject: Re: [PATCH] drm/amdgpu: Fix the null pointer issue for tdr >>>> >>>> Am 08.11.19 um 10:39 schrieb Deng, Emily: >>>>> Sorry, please take your time. >>>> Have you seen my other response a bit below? >>>> >>>> I can't follow how it would be possible for job->s_fence to be NULL >>>> without the job also being freed. >>>> >>>> So it looks like this patch is just papering over some bigger issues. >>>> >>>> Regards, >>>> Christian. >>>> >>>>> Best wishes >>>>> Emily Deng >>>>> >>>>> >>>>> >>>>>> -----Original Message----- >>>>>> From: Koenig, Christian <Christian.Koenig@xxxxxxx> >>>>>> Sent: Friday, November 8, 2019 5:08 PM >>>>>> To: Deng, Emily <Emily.Deng@xxxxxxx>; amd- >> gfx@xxxxxxxxxxxxxxxxxxxxx >>>>>> Subject: Re: [PATCH] drm/amdgpu: Fix the null pointer issue for tdr >>>>>> >>>>>> Am 08.11.19 um 09:52 schrieb Deng, Emily: >>>>>>> Ping..... >>>>>> You need to give me at least enough time to wake up :) >>>>>> >>>>>>> Best wishes >>>>>>> Emily Deng >>>>>>> >>>>>>> >>>>>>> >>>>>>>> -----Original Message----- >>>>>>>> From: amd-gfx <amd-gfx-bounces@xxxxxxxxxxxxxxxxxxxxx> On Behalf >>>>>>>> Of Deng, Emily >>>>>>>> Sent: Friday, November 8, 2019 10:56 AM >>>>>>>> To: Koenig, Christian <Christian.Koenig@xxxxxxx>; amd- >>>>>>>> gfx@xxxxxxxxxxxxxxxxxxxxx >>>>>>>> Subject: RE: [PATCH] drm/amdgpu: Fix the null pointer issue for >>>>>>>> tdr >>>>>>>> >>>>>>>>> -----Original Message----- >>>>>>>>> From: Christian König <ckoenig.leichtzumerken@xxxxxxxxx> >>>>>>>>> Sent: Thursday, November 7, 2019 7:28 PM >>>>>>>>> To: Deng, Emily <Emily.Deng@xxxxxxx>; >>>>>>>>> amd-gfx@xxxxxxxxxxxxxxxxxxxxx >>>>>>>>> Subject: Re: [PATCH] drm/amdgpu: Fix the null pointer issue for >>>>>>>>> tdr >>>>>>>>> >>>>>>>>> Am 07.11.19 um 11:25 schrieb Emily Deng: >>>>>>>>>> When the job is already signaled, the s_fence is freed. Then it >>>>>>>>>> will has null pointer in amdgpu_device_gpu_recover. >>>>>>>>> NAK, the s_fence is only set to NULL when the job is destroyed. >>>>>>>>> See drm_sched_job_cleanup(). >>>>>>>> I know it is set to NULL in drm_sched_job_cleanup. But in one >>>>>>>> case, when it enter into the amdgpu_device_gpu_recover, it >>>>>>>> already in drm_sched_job_cleanup, and at this time, it will go to free >> job. >>>>>>>> But the amdgpu_device_gpu_recover sometimes is faster. At that >>>>>>>> time, job is not freed, but s_fence is already NULL. >>>>>> No, that case can't happen. See here: >>>>>> >>>>>>> drm_sched_job_cleanup(s_job); >>>>>>> >>>>>>> amdgpu_ring_priority_put(ring, s_job->s_priority); >>>>>>> dma_fence_put(job->fence); >>>>>>> amdgpu_sync_free(&job->sync); >>>>>>> amdgpu_sync_free(&job->sched_sync); >>>>>>> kfree(job); >>>>>> The job itself is freed up directly after freeing the reference to the >> s_fence. >>>>>> So you are just papering over a much bigger problem here. This >>>>>> patch is a clear NAK. >>>>>> >>>>>> Regards, >>>>>> Christian. >>>>>> >>>>>>>>> When you see a job without an s_fence then that means the >>>>>>>>> problem is somewhere else. >>>>>>>>> >>>>>>>>> Regards, >>>>>>>>> Christian. >>>>>>>>> >>>>>>>>>> Signed-off-by: Emily Deng <Emily.Deng@xxxxxxx> >>>>>>>>>> --- >>>>>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 2 +- >>>>>>>>>> drivers/gpu/drm/scheduler/sched_main.c | 11 ++++++----- >>>>>>>>>> 2 files changed, 7 insertions(+), 6 deletions(-) >>>>>>>>>> >>>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>>>>>>>>> index e6ce949..5a8f08e 100644 >>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>>>>>>>>> @@ -4075,7 +4075,7 @@ int amdgpu_device_gpu_recover(struct >>>>>>>>> amdgpu_device *adev, >>>>>>>>>> * >>>>>>>>>> * job->base holds a reference to parent fence >>>>>>>>>> */ >>>>>>>>>> - if (job && job->base.s_fence->parent && >>>>>>>>>> + if (job && job->base.s_fence && job->base.s_fence->parent >>>> && >>>>>>>>>> dma_fence_is_signaled(job->base.s_fence->parent)) >>>>>>>>>> job_signaled = true; >>>>>>>>>> >>>>>>>>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c >>>>>>>>>> b/drivers/gpu/drm/scheduler/sched_main.c >>>>>>>>>> index 31809ca..56cc10e 100644 >>>>>>>>>> --- a/drivers/gpu/drm/scheduler/sched_main.c >>>>>>>>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c >>>>>>>>>> @@ -334,8 +334,8 @@ void drm_sched_increase_karma(struct >>>>>>>>> drm_sched_job >>>>>>>>>> *bad) >>>>>>>>>> >>>>>>>>>> spin_lock(&rq->lock); >>>>>>>>>> list_for_each_entry_safe(entity, tmp, &rq- >>>>> entities, >>>>>>>>> list) { >>>>>>>>>> - if (bad->s_fence->scheduled.context >>>> == >>>>>>>>>> - entity->fence_context) { >>>>>>>>>> + if (bad->s_fence && (bad->s_fence- >>>>>>>>>> scheduled.context == >>>>>>>>>> + entity->fence_context)) { >>>>>>>>>> if (atomic_read(&bad- >>>>> karma) > >>>>>>>>>> bad->sched->hang_limit) >>>>>>>>>> if (entity->guilty) >>>>>>>>>> @@ -376,7 +376,7 @@ void drm_sched_stop(struct >>>> drm_gpu_scheduler >>>>>>>>> *sched, struct drm_sched_job *bad) >>>>>>>>>> * This iteration is thread safe as sched thread is stopped. >>>>>>>>>> */ >>>>>>>>>> list_for_each_entry_safe_reverse(s_job, tmp, &sched- >>>>>>>>>> ring_mirror_list, node) { >>>>>>>>>> - if (s_job->s_fence->parent && >>>>>>>>>> + if (s_job->s_fence && s_job->s_fence->parent && >>>>>>>>>> dma_fence_remove_callback(s_job->s_fence- >>>>> parent, >>>>>>>>>> &s_job->cb)) { >>>>>>>>>> atomic_dec(&sched->hw_rq_count); @@ - >>>> 395,7 >>>>>>>> +395,8 @@ void >>>>>>>>>> drm_sched_stop(struct drm_gpu_scheduler >>>>>>>>> *sched, struct drm_sched_job *bad) >>>>>>>>>> * >>>>>>>>>> * Job is still alive so fence refcount at least 1 >>>>>>>>>> */ >>>>>>>>>> - dma_fence_wait(&s_job->s_fence->finished, >>>> false); >>>>>>>>>> + if (s_job->s_fence) >>>>>>>>>> + dma_fence_wait(&s_job->s_fence- >>>>> finished, >>>>>>>>> false); >>>>>>>>>> /* >>>>>>>>>> * We must keep bad job alive for later use >>>> during @@ >>>>>>>>> -438,7 >>>>>>>>>> +439,7 @@ void drm_sched_start(struct drm_gpu_scheduler >> *sched, >>>>>>>>>> +bool >>>>>>>>> full_recovery) >>>>>>>>>> * GPU recovers can't run in parallel. >>>>>>>>>> */ >>>>>>>>>> list_for_each_entry_safe(s_job, tmp, >>>>>>>>>> &sched->ring_mirror_list, >>>>>>>>>> node) >>>>>>>>> { >>>>>>>>>> - struct dma_fence *fence = s_job->s_fence->parent; >>>>>>>>>> + struct dma_fence *fence = s_job->s_fence ? s_job- >>>>> s_fence- >>>>>>>>>> parent : >>>>>>>>>> +NULL; >>>>>>>>>> >>>>>>>>>> atomic_inc(&sched->hw_rq_count); >>>>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> amd-gfx mailing list >>>>>>>> amd-gfx@xxxxxxxxxxxxxxxxxxxxx >>>>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx _______________________________________________ amd-gfx mailing list amd-gfx@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/amd-gfx