RE: [PATCH] drm/amdgpu: Fix the null pointer issue for tdr

"Deng, Emily" <Emily.Deng@xxxxxxx> · Fri, 8 Nov 2019 10:32:36 +0000



Hi Christian,
     The drm_sched_job_timedout-> amdgpu_job_timedout call amdgpu_device_gpu_recover. I mean the main scheduler free the jobs while in amdgpu_device_gpu_recover, and before calling drm_sched_stop. 

Best wishes
Emily Deng


>-----Original Message-----
>From: Koenig, Christian <Christian.Koenig@xxxxxxx>
>Sent: Friday, November 8, 2019 6:26 PM
>To: Deng, Emily <Emily.Deng@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx
>Subject: Re: [PATCH] drm/amdgpu: Fix the null pointer issue for tdr
>
>Hi Emily,
>
>well who is calling amdgpu_device_gpu_recover() in this case?
>
>When it's not the scheduler we shouldn't have a guilty job in the first place.
>
>Regards,
>Christian.
>
>Am 08.11.19 um 11:22 schrieb Deng, Emily:
>> Hi Chrisitan,
>>       No, I am with the new branch and also has the patch. Even it are freed by
>main scheduler, how we could avoid main scheduler to free jobs while enter
>to function amdgpu_device_gpu_recover?
>>
>> Best wishes
>> Emily Deng
>>
>>
>>
>>> -----Original Message-----
>>> From: Koenig, Christian <Christian.Koenig@xxxxxxx>
>>> Sent: Friday, November 8, 2019 6:15 PM
>>> To: Deng, Emily <Emily.Deng@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx
>>> Subject: Re: [PATCH] drm/amdgpu: Fix the null pointer issue for tdr
>>>
>>> Hi Emily,
>>>
>>> in this case you are on an old code branch.
>>>
>>> Jobs are freed now by the main scheduler thread and only if no
>>> timeout handler is running.
>>>
>>> See this patch here:
>>>> commit 5918045c4ed492fb5813f980dcf89a90fefd0a4e
>>>> Author: Christian König <christian.koenig@xxxxxxx>
>>>> Date:   Thu Apr 18 11:00:21 2019 -0400
>>>>
>>>>      drm/scheduler: rework job destruction
>>> Regards,
>>> Christian.
>>>
>>> Am 08.11.19 um 11:11 schrieb Deng, Emily:
>>>> Hi Christian,
>>>>        Please refer to follow log, when it enter to
>>>> amdgpu_device_gpu_recover
>>> function, the bad job 000000005086879e is freeing in function
>>> amdgpu_job_free_cb  at the same time, because of the hardware fence
>signal.
>>> But amdgpu_device_gpu_recover goes faster, at this case, the s_fence
>>> is already freed, but job is not freed in time. Then this issue occurs.
>>>> [  449.792189] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring
>sdma0
>>>> timeout, signaled seq=2481, emitted seq=2483 [  449.793202]
>>>> [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information:
>>> process  pid 0 thread  pid 0, s_job:000000005086879e [  449.794163]
>>> amdgpu
>>> 0000:00:08.0: GPU reset begin!
>>>> [  449.794175] Emily:amdgpu_job_free_cb,Process information: process
>>>> pid 0 thread  pid 0, s_job:000000005086879e [  449.794221]
>>>> Emily:amdgpu_job_free_cb,Process information: process  pid 0 thread
>>>> pid 0, s_job:0000000066eb74ab [  449.794222]
>>>> Emily:amdgpu_job_free_cb,Process information: process  pid 0 thread
>>>> pid 0, s_job:00000000d4438ad9 [  449.794255]
>>>> Emily:amdgpu_job_free_cb,Process information: process  pid 0 thread
>>>> pid 0, s_job:00000000b6d69c65 [  449.794257]
>>>> Emily:amdgpu_job_free_cb,Process information: process  pid 0 thread
>>>> pid 0,
>>> s_job:00000000ea85e922 [  449.794287]
>>> Emily:amdgpu_job_free_cb,Process
>>> information: process  pid 0 thread  pid 0, s_job:00000000ed3a5ac6 [
>>> 449.794366] BUG: unable to handle kernel NULL pointer dereference at
>>> 00000000000000c0 [  449.800818] PGD 0 P4D 0 [  449.801040] Oops: 0000
>>> [#1] SMP PTI
>>>> [  449.801338] CPU: 3 PID: 55 Comm: kworker/3:1 Tainted: G           OE
>>> 4.18.0-15-generic #16~18.04.1-Ubuntu
>>>> [  449.802157] Hardware name: QEMU Standard PC (i440FX + PIIX,
>>>> 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014 [  449.802944]
>>>> Workqueue: events drm_sched_job_timedout [amd_sched] [  449.803488]
>RIP:
>>> 0010:amdgpu_device_gpu_recover+0x1da/0xb60 [amdgpu]
>>>> [  449.804020] Code: dd ff ff 49 39 c5 48 89 55 a8 0f 85 56 ff ff ff
>>>> 45 85 e4 0f
>>> 85 a1 00 00 00 48 8b 45 b0 48 85 c0 0f 84 60 01 00 00 48 8b 40 10 <48> 8b
>98
>>> c0 00         00 00 48 85 db 0f 84 4c 01 00 00 48 8b 43 48 a8 01
>>>> [  449.805593] RSP: 0018:ffffb4c7c08f7d68 EFLAGS: 00010286 [
>>>> 449.806032] RAX: 0000000000000000 RBX: 0000000000000000 RCX:
>>>> 0000000000000000 [  449.806625] RDX: ffffb4c7c08f5ac0 RSI:
>>>> 0000000fffffffe0 RDI: 0000000000000246 [  449.807224] RBP:
>>>> ffffb4c7c08f7de0 R08: 00000068b9d54000 R09: 0000000000000000 [
>>>> 449.807818] R10: 0000000000000000 R11: 0000000000000148 R12:
>>>> 0000000000000000 [  449.808411] R13: ffffb4c7c08f7da0 R14:
>>>> ffff8d82b8525d40 R15: ffff8d82b8525d40 [  449.809004] FS:
>>>> 0000000000000000(0000) GS:ffff8d82bfd80000(0000)
>>>> knlGS:0000000000000000 [  449.809674] CS:  0010 DS: 0000 ES: 0000 CR0:
>>>> 0000000080050033 [  449.810153] CR2: 00000000000000c0 CR3:
>>>> 000000003cc0a001 CR4: 00000000003606e0 [  449.810747] DR0:
>>> 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [
>>> 449.811344] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
>>> 0000000000000400 [  449.811937] Call Trace:
>>>> [  449.812206]  amdgpu_job_timedout+0x114/0x140 [amdgpu] [
>>>> 449.812635]  drm_sched_job_timedout+0x44/0x90 [amd_sched] [
>>>> 449.813139]  ? amdgpu_cgs_destroy_device+0x10/0x10 [amdgpu] [
>>>> 449.813609]  ? drm_sched_job_timedout+0x44/0x90 [amd_sched] [
>>>> 449.814077]  process_one_work+0x1fd/0x3f0 [  449.814417]
>>>> worker_thread+0x34/0x410 [  449.814728]  kthread+0x121/0x140 [
>>>> 449.815004]  ? process_one_work+0x3f0/0x3f0 [  449.815374]  ?
>>>> kthread_create_worker_on_cpu+0x70/0x70
>>>> [  449.815799]  ret_from_fork+0x35/0x40
>>>>
>>>>> -----Original Message-----
>>>>> From: Koenig, Christian <Christian.Koenig@xxxxxxx>
>>>>> Sent: Friday, November 8, 2019 5:43 PM
>>>>> To: Deng, Emily <Emily.Deng@xxxxxxx>; amd-
>gfx@xxxxxxxxxxxxxxxxxxxxx
>>>>> Subject: Re: [PATCH] drm/amdgpu: Fix the null pointer issue for tdr
>>>>>
>>>>> Am 08.11.19 um 10:39 schrieb Deng, Emily:
>>>>>> Sorry, please take your time.
>>>>> Have you seen my other response a bit below?
>>>>>
>>>>> I can't follow how it would be possible for job->s_fence to be NULL
>>>>> without the job also being freed.
>>>>>
>>>>> So it looks like this patch is just papering over some bigger issues.
>>>>>
>>>>> Regards,
>>>>> Christian.
>>>>>
>>>>>> Best wishes
>>>>>> Emily Deng
>>>>>>
>>>>>>
>>>>>>
>>>>>>> -----Original Message-----
>>>>>>> From: Koenig, Christian <Christian.Koenig@xxxxxxx>
>>>>>>> Sent: Friday, November 8, 2019 5:08 PM
>>>>>>> To: Deng, Emily <Emily.Deng@xxxxxxx>; amd-
>>> gfx@xxxxxxxxxxxxxxxxxxxxx
>>>>>>> Subject: Re: [PATCH] drm/amdgpu: Fix the null pointer issue for
>>>>>>> tdr
>>>>>>>
>>>>>>> Am 08.11.19 um 09:52 schrieb Deng, Emily:
>>>>>>>> Ping.....
>>>>>>> You need to give me at least enough time to wake up :)
>>>>>>>
>>>>>>>> Best wishes
>>>>>>>> Emily Deng
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>> -----Original Message-----
>>>>>>>>> From: amd-gfx <amd-gfx-bounces@xxxxxxxxxxxxxxxxxxxxx> On
>Behalf
>>>>>>>>> Of Deng, Emily
>>>>>>>>> Sent: Friday, November 8, 2019 10:56 AM
>>>>>>>>> To: Koenig, Christian <Christian.Koenig@xxxxxxx>; amd-
>>>>>>>>> gfx@xxxxxxxxxxxxxxxxxxxxx
>>>>>>>>> Subject: RE: [PATCH] drm/amdgpu: Fix the null pointer issue for
>>>>>>>>> tdr
>>>>>>>>>
>>>>>>>>>> -----Original Message-----
>>>>>>>>>> From: Christian König <ckoenig.leichtzumerken@xxxxxxxxx>
>>>>>>>>>> Sent: Thursday, November 7, 2019 7:28 PM
>>>>>>>>>> To: Deng, Emily <Emily.Deng@xxxxxxx>;
>>>>>>>>>> amd-gfx@xxxxxxxxxxxxxxxxxxxxx
>>>>>>>>>> Subject: Re: [PATCH] drm/amdgpu: Fix the null pointer issue
>>>>>>>>>> for tdr
>>>>>>>>>>
>>>>>>>>>> Am 07.11.19 um 11:25 schrieb Emily Deng:
>>>>>>>>>>> When the job is already signaled, the s_fence is freed. Then
>>>>>>>>>>> it will has null pointer in amdgpu_device_gpu_recover.
>>>>>>>>>> NAK, the s_fence is only set to NULL when the job is destroyed.
>>>>>>>>>> See drm_sched_job_cleanup().
>>>>>>>>> I know it is set to NULL in drm_sched_job_cleanup. But in one
>>>>>>>>> case, when it enter into the amdgpu_device_gpu_recover, it
>>>>>>>>> already in drm_sched_job_cleanup, and at this time, it will go
>>>>>>>>> to free
>>> job.
>>>>>>>>> But the amdgpu_device_gpu_recover sometimes is faster. At that
>>>>>>>>> time, job is not freed, but s_fence is already NULL.
>>>>>>> No, that case can't happen. See here:
>>>>>>>
>>>>>>>>            drm_sched_job_cleanup(s_job);
>>>>>>>>
>>>>>>>>            amdgpu_ring_priority_put(ring, s_job->s_priority);
>>>>>>>>            dma_fence_put(job->fence);
>>>>>>>>            amdgpu_sync_free(&job->sync);
>>>>>>>>            amdgpu_sync_free(&job->sched_sync);
>>>>>>>>            kfree(job);
>>>>>>> The job itself is freed up directly after freeing the reference
>>>>>>> to the
>>> s_fence.
>>>>>>> So you are just papering over a much bigger problem here. This
>>>>>>> patch is a clear NAK.
>>>>>>>
>>>>>>> Regards,
>>>>>>> Christian.
>>>>>>>
>>>>>>>>>> When you see a job without an s_fence then that means the
>>>>>>>>>> problem is somewhere else.
>>>>>>>>>>
>>>>>>>>>> Regards,
>>>>>>>>>> Christian.
>>>>>>>>>>
>>>>>>>>>>> Signed-off-by: Emily Deng <Emily.Deng@xxxxxxx>
>>>>>>>>>>> ---
>>>>>>>>>>>       drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  2 +-
>>>>>>>>>>>       drivers/gpu/drm/scheduler/sched_main.c     | 11 ++++++-----
>>>>>>>>>>>       2 files changed, 7 insertions(+), 6 deletions(-)
>>>>>>>>>>>
>>>>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>>>>> index e6ce949..5a8f08e 100644
>>>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>>>>> @@ -4075,7 +4075,7 @@ int
>amdgpu_device_gpu_recover(struct
>>>>>>>>>> amdgpu_device *adev,
>>>>>>>>>>>       	 *
>>>>>>>>>>>       	 * job->base holds a reference to parent fence
>>>>>>>>>>>       	 */
>>>>>>>>>>> -	if (job && job->base.s_fence->parent &&
>>>>>>>>>>> +	if (job && job->base.s_fence && job->base.s_fence->parent
>>>>> &&
>>>>>>>>>>>       	    dma_fence_is_signaled(job->base.s_fence->parent))
>>>>>>>>>>>       		job_signaled = true;
>>>>>>>>>>>
>>>>>>>>>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>>>>>> b/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>>>>>> index 31809ca..56cc10e 100644
>>>>>>>>>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>>>>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>>>>>> @@ -334,8 +334,8 @@ void drm_sched_increase_karma(struct
>>>>>>>>>> drm_sched_job
>>>>>>>>>>> *bad)
>>>>>>>>>>>
>>>>>>>>>>>       			spin_lock(&rq->lock);
>>>>>>>>>>>       			list_for_each_entry_safe(entity, tmp,
>&rq-
>>>>>> entities,
>>>>>>>>>> list) {
>>>>>>>>>>> -				if (bad->s_fence->scheduled.context
>>>>> ==
>>>>>>>>>>> -				    entity->fence_context) {
>>>>>>>>>>> +				if (bad->s_fence && (bad->s_fence-
>>>>>>>>>>> scheduled.context ==
>>>>>>>>>>> +				    entity->fence_context)) {
>>>>>>>>>>>       					if (atomic_read(&bad-
>>>>>> karma) >
>>>>>>>>>>>       					    bad->sched-
>>hang_limit)
>>>>>>>>>>>       						if (entity-
>>guilty) @@ -376,7 +376,7 @@ void
>>>>>>>>>>> drm_sched_stop(struct
>>>>> drm_gpu_scheduler
>>>>>>>>>> *sched, struct drm_sched_job *bad)
>>>>>>>>>>>       	 * This iteration is thread safe as sched thread is
>stopped.
>>>>>>>>>>>       	 */
>>>>>>>>>>>       	list_for_each_entry_safe_reverse(s_job, tmp, &sched-
>>>>>>>>>>> ring_mirror_list, node) {
>>>>>>>>>>> -		if (s_job->s_fence->parent &&
>>>>>>>>>>> +		if (s_job->s_fence && s_job->s_fence->parent &&
>>>>>>>>>>>       		    dma_fence_remove_callback(s_job-
>>s_fence-
>>>>>> parent,
>>>>>>>>>>>       					      &s_job->cb)) {
>>>>>>>>>>>       			atomic_dec(&sched->hw_rq_count);
>@@ -
>>>>> 395,7
>>>>>>>>> +395,8 @@ void
>>>>>>>>>>> drm_sched_stop(struct drm_gpu_scheduler
>>>>>>>>>> *sched, struct drm_sched_job *bad)
>>>>>>>>>>>       			 *
>>>>>>>>>>>       			 * Job is still alive so fence refcount at
>least 1
>>>>>>>>>>>       			 */
>>>>>>>>>>> -			dma_fence_wait(&s_job->s_fence->finished,
>>>>> false);
>>>>>>>>>>> +			if (s_job->s_fence)
>>>>>>>>>>> +				dma_fence_wait(&s_job->s_fence-
>>>>>> finished,
>>>>>>>>>> false);
>>>>>>>>>>>       			/*
>>>>>>>>>>>       			 * We must keep bad job alive for later
>use
>>>>> during @@
>>>>>>>>>> -438,7
>>>>>>>>>>> +439,7 @@ void drm_sched_start(struct drm_gpu_scheduler
>>> *sched,
>>>>>>>>>>> +bool
>>>>>>>>>> full_recovery)
>>>>>>>>>>>       	 * GPU recovers can't run in parallel.
>>>>>>>>>>>       	 */
>>>>>>>>>>>       	list_for_each_entry_safe(s_job, tmp,
>>>>>>>>>>> &sched->ring_mirror_list,
>>>>>>>>>>> node)
>>>>>>>>>> {
>>>>>>>>>>> -		struct dma_fence *fence = s_job->s_fence->parent;
>>>>>>>>>>> +		struct dma_fence *fence = s_job->s_fence ? s_job-
>>>>>> s_fence-
>>>>>>>>>>> parent :
>>>>>>>>>>> +NULL;
>>>>>>>>>>>
>>>>>>>>>>>       		atomic_inc(&sched->hw_rq_count);
>>>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> amd-gfx mailing list
>>>>>>>>> amd-gfx@xxxxxxxxxxxxxxxxxxxxx
>>>>>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx

_______________________________________________
amd-gfx mailing list
amd-gfx@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/amd-gfx