RE: [PATCH] drm/amdgpu: Fix the null pointer issue for tdr

"Deng, Emily" <Emily.Deng@xxxxxxx> · Mon, 11 Nov 2019 07:19:04 +0000



Hi Andrey,
    I don’t think your patch will help for this. As it will may call kthread_should_park in drm_sched_cleanup_jobs first, and then call kcl_kthread_park. And then it still has a race between the 2 threads.

Best wishes
Emily Deng


>-----Original Message-----
>From: Grodzovsky, Andrey <Andrey.Grodzovsky@xxxxxxx>
>Sent: Saturday, November 9, 2019 3:01 AM
>To: Koenig, Christian <Christian.Koenig@xxxxxxx>; Deng, Emily
><Emily.Deng@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx
>Subject: Re: [PATCH] drm/amdgpu: Fix the null pointer issue for tdr
>
>
>On 11/8/19 5:35 AM, Koenig, Christian wrote:
>> Hi Emily,
>>
>> exactly that can't happen. See here:
>>
>>>          /* Don't destroy jobs while the timeout worker is running */
>>>          if (sched->timeout != MAX_SCHEDULE_TIMEOUT &&
>>>              !cancel_delayed_work(&sched->work_tdr))
>>>                  return NULL;
>> We never free jobs while the timeout working is running to prevent
>> exactly that issue.
>
>
>I don't think this protects us if drm_sched_cleanup_jobs is called for scheduler
>which didn't experience a timeout, in amdgpu_device_gpu_recover we access
>sched->ring_mirror_list for all the schedulers on a device so this condition
>above won't protect us. What in fact could help maybe is my recent patch
>541c521 drm/sched: Avoid job cleanup if sched thread is parked. because we
>do park each of the scheduler threads during tdr job before trying to access
>sched->ring_mirror_list.
>
>Emily - did you see this problem with that patch in place ? I only pushed it
>yesterday.
>
>Andrey
>
>
>>
>> Regards,
>> Christian.
>>
>> Am 08.11.19 um 11:32 schrieb Deng, Emily:
>>> Hi Christian,
>>>        The drm_sched_job_timedout-> amdgpu_job_timedout call
>amdgpu_device_gpu_recover. I mean the main scheduler free the jobs while
>in amdgpu_device_gpu_recover, and before calling drm_sched_stop.
>>>
>>> Best wishes
>>> Emily Deng
>>>
>>>
>>>
>>>> -----Original Message-----
>>>> From: Koenig, Christian <Christian.Koenig@xxxxxxx>
>>>> Sent: Friday, November 8, 2019 6:26 PM
>>>> To: Deng, Emily <Emily.Deng@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx
>>>> Subject: Re: [PATCH] drm/amdgpu: Fix the null pointer issue for tdr
>>>>
>>>> Hi Emily,
>>>>
>>>> well who is calling amdgpu_device_gpu_recover() in this case?
>>>>
>>>> When it's not the scheduler we shouldn't have a guilty job in the first place.
>>>>
>>>> Regards,
>>>> Christian.
>>>>
>>>> Am 08.11.19 um 11:22 schrieb Deng, Emily:
>>>>> Hi Chrisitan,
>>>>>         No, I am with the new branch and also has the patch. Even
>>>>> it are freed by
>>>> main scheduler, how we could avoid main scheduler to free jobs while
>>>> enter to function amdgpu_device_gpu_recover?
>>>>> Best wishes
>>>>> Emily Deng
>>>>>
>>>>>
>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Koenig, Christian <Christian.Koenig@xxxxxxx>
>>>>>> Sent: Friday, November 8, 2019 6:15 PM
>>>>>> To: Deng, Emily <Emily.Deng@xxxxxxx>;
>>>>>> amd-gfx@xxxxxxxxxxxxxxxxxxxxx
>>>>>> Subject: Re: [PATCH] drm/amdgpu: Fix the null pointer issue for
>>>>>> tdr
>>>>>>
>>>>>> Hi Emily,
>>>>>>
>>>>>> in this case you are on an old code branch.
>>>>>>
>>>>>> Jobs are freed now by the main scheduler thread and only if no
>>>>>> timeout handler is running.
>>>>>>
>>>>>> See this patch here:
>>>>>>> commit 5918045c4ed492fb5813f980dcf89a90fefd0a4e
>>>>>>> Author: Christian König <christian.koenig@xxxxxxx>
>>>>>>> Date:   Thu Apr 18 11:00:21 2019 -0400
>>>>>>>
>>>>>>>        drm/scheduler: rework job destruction
>>>>>> Regards,
>>>>>> Christian.
>>>>>>
>>>>>> Am 08.11.19 um 11:11 schrieb Deng, Emily:
>>>>>>> Hi Christian,
>>>>>>>          Please refer to follow log, when it enter to
>>>>>>> amdgpu_device_gpu_recover
>>>>>> function, the bad job 000000005086879e is freeing in function
>>>>>> amdgpu_job_free_cb  at the same time, because of the hardware
>>>>>> fence
>>>> signal.
>>>>>> But amdgpu_device_gpu_recover goes faster, at this case, the
>>>>>> s_fence is already freed, but job is not freed in time. Then this issue
>occurs.
>>>>>>> [  449.792189] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring
>>>> sdma0
>>>>>>> timeout, signaled seq=2481, emitted seq=2483 [  449.793202]
>>>>>>> [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information:
>>>>>> process  pid 0 thread  pid 0, s_job:000000005086879e [
>>>>>> 449.794163] amdgpu
>>>>>> 0000:00:08.0: GPU reset begin!
>>>>>>> [  449.794175] Emily:amdgpu_job_free_cb,Process information:
>>>>>>> process pid 0 thread  pid 0, s_job:000000005086879e [
>>>>>>> 449.794221] Emily:amdgpu_job_free_cb,Process information: process
>>>>>>> pid 0 thread pid 0, s_job:0000000066eb74ab [  449.794222]
>>>>>>> Emily:amdgpu_job_free_cb,Process information: process  pid 0
>>>>>>> thread pid 0, s_job:00000000d4438ad9 [  449.794255]
>>>>>>> Emily:amdgpu_job_free_cb,Process information: process  pid 0
>>>>>>> thread pid 0, s_job:00000000b6d69c65 [  449.794257]
>>>>>>> Emily:amdgpu_job_free_cb,Process information: process  pid 0
>>>>>>> thread pid 0,
>>>>>> s_job:00000000ea85e922 [  449.794287]
>>>>>> Emily:amdgpu_job_free_cb,Process
>>>>>> information: process  pid 0 thread  pid 0, s_job:00000000ed3a5ac6
>>>>>> [ 449.794366] BUG: unable to handle kernel NULL pointer
>>>>>> dereference at
>>>>>> 00000000000000c0 [  449.800818] PGD 0 P4D 0 [  449.801040] Oops:
>>>>>> 0000 [#1] SMP PTI
>>>>>>> [  449.801338] CPU: 3 PID: 55 Comm: kworker/3:1 Tainted: G           OE
>>>>>> 4.18.0-15-generic #16~18.04.1-Ubuntu
>>>>>>> [  449.802157] Hardware name: QEMU Standard PC (i440FX + PIIX,
>>>>>>> 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014 [  449.802944]
>>>>>>> Workqueue: events drm_sched_job_timedout [amd_sched] [
>>>>>>> 449.803488]
>>>> RIP:
>>>>>> 0010:amdgpu_device_gpu_recover+0x1da/0xb60 [amdgpu]
>>>>>>> [  449.804020] Code: dd ff ff 49 39 c5 48 89 55 a8 0f 85 56 ff ff
>>>>>>> ff
>>>>>>> 45 85 e4 0f
>>>>>> 85 a1 00 00 00 48 8b 45 b0 48 85 c0 0f 84 60 01 00 00 48 8b 40 10
>>>>>> <48> 8b
>>>> 98
>>>>>> c0 00         00 00 48 85 db 0f 84 4c 01 00 00 48 8b 43 48 a8 01
>>>>>>> [  449.805593] RSP: 0018:ffffb4c7c08f7d68 EFLAGS: 00010286 [
>>>>>>> 449.806032] RAX: 0000000000000000 RBX: 0000000000000000 RCX:
>>>>>>> 0000000000000000 [  449.806625] RDX: ffffb4c7c08f5ac0 RSI:
>>>>>>> 0000000fffffffe0 RDI: 0000000000000246 [  449.807224] RBP:
>>>>>>> ffffb4c7c08f7de0 R08: 00000068b9d54000 R09: 0000000000000000 [
>>>>>>> 449.807818] R10: 0000000000000000 R11: 0000000000000148 R12:
>>>>>>> 0000000000000000 [  449.808411] R13: ffffb4c7c08f7da0 R14:
>>>>>>> ffff8d82b8525d40 R15: ffff8d82b8525d40 [  449.809004] FS:
>>>>>>> 0000000000000000(0000) GS:ffff8d82bfd80000(0000)
>>>>>>> knlGS:0000000000000000 [  449.809674] CS:  0010 DS: 0000 ES: 0000
>CR0:
>>>>>>> 0000000080050033 [  449.810153] CR2: 00000000000000c0 CR3:
>>>>>>> 000000003cc0a001 CR4: 00000000003606e0 [  449.810747] DR0:
>>>>>> 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>[
>>>>>> 449.811344] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
>>>>>> 0000000000000400 [  449.811937] Call Trace:
>>>>>>> [  449.812206]  amdgpu_job_timedout+0x114/0x140 [amdgpu] [
>>>>>>> 449.812635]  drm_sched_job_timedout+0x44/0x90 [amd_sched] [
>>>>>>> 449.813139]  ? amdgpu_cgs_destroy_device+0x10/0x10 [amdgpu] [
>>>>>>> 449.813609]  ? drm_sched_job_timedout+0x44/0x90 [amd_sched] [
>>>>>>> 449.814077]  process_one_work+0x1fd/0x3f0 [  449.814417]
>>>>>>> worker_thread+0x34/0x410 [  449.814728]  kthread+0x121/0x140 [
>>>>>>> 449.815004]  ? process_one_work+0x3f0/0x3f0 [  449.815374]  ?
>>>>>>> kthread_create_worker_on_cpu+0x70/0x70
>>>>>>> [  449.815799]  ret_from_fork+0x35/0x40
>>>>>>>
>>>>>>>> -----Original Message-----
>>>>>>>> From: Koenig, Christian <Christian.Koenig@xxxxxxx>
>>>>>>>> Sent: Friday, November 8, 2019 5:43 PM
>>>>>>>> To: Deng, Emily <Emily.Deng@xxxxxxx>; amd-
>>>> gfx@xxxxxxxxxxxxxxxxxxxxx
>>>>>>>> Subject: Re: [PATCH] drm/amdgpu: Fix the null pointer issue for
>>>>>>>> tdr
>>>>>>>>
>>>>>>>> Am 08.11.19 um 10:39 schrieb Deng, Emily:
>>>>>>>>> Sorry, please take your time.
>>>>>>>> Have you seen my other response a bit below?
>>>>>>>>
>>>>>>>> I can't follow how it would be possible for job->s_fence to be
>>>>>>>> NULL without the job also being freed.
>>>>>>>>
>>>>>>>> So it looks like this patch is just papering over some bigger issues.
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>> Christian.
>>>>>>>>
>>>>>>>>> Best wishes
>>>>>>>>> Emily Deng
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> -----Original Message-----
>>>>>>>>>> From: Koenig, Christian <Christian.Koenig@xxxxxxx>
>>>>>>>>>> Sent: Friday, November 8, 2019 5:08 PM
>>>>>>>>>> To: Deng, Emily <Emily.Deng@xxxxxxx>; amd-
>>>>>> gfx@xxxxxxxxxxxxxxxxxxxxx
>>>>>>>>>> Subject: Re: [PATCH] drm/amdgpu: Fix the null pointer issue
>>>>>>>>>> for tdr
>>>>>>>>>>
>>>>>>>>>> Am 08.11.19 um 09:52 schrieb Deng, Emily:
>>>>>>>>>>> Ping.....
>>>>>>>>>> You need to give me at least enough time to wake up :)
>>>>>>>>>>
>>>>>>>>>>> Best wishes
>>>>>>>>>>> Emily Deng
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>> From: amd-gfx <amd-gfx-bounces@xxxxxxxxxxxxxxxxxxxxx> On
>>>> Behalf
>>>>>>>>>>>> Of Deng, Emily
>>>>>>>>>>>> Sent: Friday, November 8, 2019 10:56 AM
>>>>>>>>>>>> To: Koenig, Christian <Christian.Koenig@xxxxxxx>; amd-
>>>>>>>>>>>> gfx@xxxxxxxxxxxxxxxxxxxxx
>>>>>>>>>>>> Subject: RE: [PATCH] drm/amdgpu: Fix the null pointer issue
>>>>>>>>>>>> for tdr
>>>>>>>>>>>>
>>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>>> From: Christian König <ckoenig.leichtzumerken@xxxxxxxxx>
>>>>>>>>>>>>> Sent: Thursday, November 7, 2019 7:28 PM
>>>>>>>>>>>>> To: Deng, Emily <Emily.Deng@xxxxxxx>;
>>>>>>>>>>>>> amd-gfx@xxxxxxxxxxxxxxxxxxxxx
>>>>>>>>>>>>> Subject: Re: [PATCH] drm/amdgpu: Fix the null pointer issue
>>>>>>>>>>>>> for tdr
>>>>>>>>>>>>>
>>>>>>>>>>>>> Am 07.11.19 um 11:25 schrieb Emily Deng:
>>>>>>>>>>>>>> When the job is already signaled, the s_fence is freed.
>>>>>>>>>>>>>> Then it will has null pointer in amdgpu_device_gpu_recover.
>>>>>>>>>>>>> NAK, the s_fence is only set to NULL when the job is destroyed.
>>>>>>>>>>>>> See drm_sched_job_cleanup().
>>>>>>>>>>>> I know it is set to NULL in drm_sched_job_cleanup. But in
>>>>>>>>>>>> one case, when it enter into the amdgpu_device_gpu_recover,
>>>>>>>>>>>> it already in drm_sched_job_cleanup, and at this time, it
>>>>>>>>>>>> will go to free
>>>>>> job.
>>>>>>>>>>>> But the amdgpu_device_gpu_recover sometimes is faster. At
>>>>>>>>>>>> that time, job is not freed, but s_fence is already NULL.
>>>>>>>>>> No, that case can't happen. See here:
>>>>>>>>>>
>>>>>>>>>>>              drm_sched_job_cleanup(s_job);
>>>>>>>>>>>
>>>>>>>>>>>              amdgpu_ring_priority_put(ring, s_job->s_priority);
>>>>>>>>>>>              dma_fence_put(job->fence);
>>>>>>>>>>>              amdgpu_sync_free(&job->sync);
>>>>>>>>>>>              amdgpu_sync_free(&job->sched_sync);
>>>>>>>>>>>              kfree(job);
>>>>>>>>>> The job itself is freed up directly after freeing the
>>>>>>>>>> reference to the
>>>>>> s_fence.
>>>>>>>>>> So you are just papering over a much bigger problem here. This
>>>>>>>>>> patch is a clear NAK.
>>>>>>>>>>
>>>>>>>>>> Regards,
>>>>>>>>>> Christian.
>>>>>>>>>>
>>>>>>>>>>>>> When you see a job without an s_fence then that means the
>>>>>>>>>>>>> problem is somewhere else.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>> Christian.
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Signed-off-by: Emily Deng <Emily.Deng@xxxxxxx>
>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>>         drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  2 +-
>>>>>>>>>>>>>>         drivers/gpu/drm/scheduler/sched_main.c     | 11 ++++++-
>----
>>>>>>>>>>>>>>         2 files changed, 7 insertions(+), 6 deletions(-)
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>>>>>>>> index e6ce949..5a8f08e 100644
>>>>>>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>>>>>>>> @@ -4075,7 +4075,7 @@ int
>>>> amdgpu_device_gpu_recover(struct
>>>>>>>>>>>>> amdgpu_device *adev,
>>>>>>>>>>>>>>         	 *
>>>>>>>>>>>>>>         	 * job->base holds a reference to parent fence
>>>>>>>>>>>>>>         	 */
>>>>>>>>>>>>>> -	if (job && job->base.s_fence->parent &&
>>>>>>>>>>>>>> +	if (job && job->base.s_fence &&
>>>>>>>>>>>>>> +job->base.s_fence->parent
>>>>>>>> &&
>>>>>>>>>>>>>>         	    dma_fence_is_signaled(job->base.s_fence->parent))
>>>>>>>>>>>>>>         		job_signaled = true;
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>>>>>>>>> b/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>>>>>>>>> index 31809ca..56cc10e 100644
>>>>>>>>>>>>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>>>>>>>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>>>>>>>>> @@ -334,8 +334,8 @@ void
>drm_sched_increase_karma(struct
>>>>>>>>>>>>> drm_sched_job
>>>>>>>>>>>>>> *bad)
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>         			spin_lock(&rq->lock);
>>>>>>>>>>>>>>         			list_for_each_entry_safe(entity, tmp,
>>>> &rq-
>>>>>>>>> entities,
>>>>>>>>>>>>> list) {
>>>>>>>>>>>>>> -				if (bad->s_fence-
>>scheduled.context
>>>>>>>> ==
>>>>>>>>>>>>>> -				    entity->fence_context) {
>>>>>>>>>>>>>> +				if (bad->s_fence && (bad-
>>s_fence-
>>>>>>>>>>>>>> scheduled.context ==
>>>>>>>>>>>>>> +				    entity->fence_context)) {
>>>>>>>>>>>>>>         					if (atomic_read(&bad-
>>>>>>>>> karma) >
>>>>>>>>>>>>>>         					    bad->sched-
>>>>> hang_limit)
>>>>>>>>>>>>>>         						if (entity-
>>>>> guilty) @@ -376,7 +376,7 @@ void
>>>>>>>>>>>>>> drm_sched_stop(struct
>>>>>>>> drm_gpu_scheduler
>>>>>>>>>>>>> *sched, struct drm_sched_job *bad)
>>>>>>>>>>>>>>         	 * This iteration is thread safe as sched thread
>>>>>>>>>>>>>> is
>>>> stopped.
>>>>>>>>>>>>>>         	 */
>>>>>>>>>>>>>>         	list_for_each_entry_safe_reverse(s_job, tmp,
>>>>>>>>>>>>>> &sched- ring_mirror_list, node) {
>>>>>>>>>>>>>> -		if (s_job->s_fence->parent &&
>>>>>>>>>>>>>> +		if (s_job->s_fence && s_job->s_fence->parent
>&&
>>>>>>>>>>>>>>         		    dma_fence_remove_callback(s_job-
>>>>> s_fence-
>>>>>>>>> parent,
>>>>>>>>>>>>>>         					      &s_job->cb)) {
>>>>>>>>>>>>>>         			atomic_dec(&sched->hw_rq_count);
>>>> @@ -
>>>>>>>> 395,7
>>>>>>>>>>>> +395,8 @@ void
>>>>>>>>>>>>>> drm_sched_stop(struct drm_gpu_scheduler
>>>>>>>>>>>>> *sched, struct drm_sched_job *bad)
>>>>>>>>>>>>>>         			 *
>>>>>>>>>>>>>>         			 * Job is still alive so fence refcount at
>>>> least 1
>>>>>>>>>>>>>>         			 */
>>>>>>>>>>>>>> -			dma_fence_wait(&s_job->s_fence-
>>finished,
>>>>>>>> false);
>>>>>>>>>>>>>> +			if (s_job->s_fence)
>>>>>>>>>>>>>> +				dma_fence_wait(&s_job-
>>s_fence-
>>>>>>>>> finished,
>>>>>>>>>>>>> false);
>>>>>>>>>>>>>>         			/*
>>>>>>>>>>>>>>         			 * We must keep bad job alive for later
>>>> use
>>>>>>>> during @@
>>>>>>>>>>>>> -438,7
>>>>>>>>>>>>>> +439,7 @@ void drm_sched_start(struct drm_gpu_scheduler
>>>>>> *sched,
>>>>>>>>>>>>>> +bool
>>>>>>>>>>>>> full_recovery)
>>>>>>>>>>>>>>         	 * GPU recovers can't run in parallel.
>>>>>>>>>>>>>>         	 */
>>>>>>>>>>>>>>         	list_for_each_entry_safe(s_job, tmp,
>>>>>>>>>>>>>> &sched->ring_mirror_list,
>>>>>>>>>>>>>> node)
>>>>>>>>>>>>> {
>>>>>>>>>>>>>> -		struct dma_fence *fence = s_job->s_fence-
>>parent;
>>>>>>>>>>>>>> +		struct dma_fence *fence = s_job->s_fence ?
>s_job-
>>>>>>>>> s_fence-
>>>>>>>>>>>>>> parent :
>>>>>>>>>>>>>> +NULL;
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>         		atomic_inc(&sched->hw_rq_count);
>>>>>>>>>>>>>>
>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>> amd-gfx mailing list
>>>>>>>>>>>> amd-gfx@xxxxxxxxxxxxxxxxxxxxx
>>>>>>>>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>> _______________________________________________
>> amd-gfx mailing list
>> amd-gfx@xxxxxxxxxxxxxxxxxxxxx
>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
_______________________________________________
amd-gfx mailing list
amd-gfx@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/amd-gfx