Re: [PATCH] drm/amdgpu: Fix the null pointer issue for tdr

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Emily,

in this case you are on an old code branch.

Jobs are freed now by the main scheduler thread and only if no timeout 
handler is running.

See this patch here:
> commit 5918045c4ed492fb5813f980dcf89a90fefd0a4e
> Author: Christian König <christian.koenig@xxxxxxx>
> Date:   Thu Apr 18 11:00:21 2019 -0400
>
>     drm/scheduler: rework job destruction

Regards,
Christian.

Am 08.11.19 um 11:11 schrieb Deng, Emily:
> Hi Christian,
>       Please refer to follow log, when it enter to amdgpu_device_gpu_recover function, the bad job 000000005086879e is freeing in function  amdgpu_job_free_cb  at the same time, because of the hardware fence signal. But amdgpu_device_gpu_recover goes faster, at this case, the s_fence is already freed, but job is not freed in time. Then this issue occurs.
>
> [  449.792189] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, signaled seq=2481, emitted seq=2483
> [  449.793202] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process  pid 0 thread  pid 0, s_job:000000005086879e
> [  449.794163] amdgpu 0000:00:08.0: GPU reset begin!
> [  449.794175] Emily:amdgpu_job_free_cb,Process information: process  pid 0 thread  pid 0, s_job:000000005086879e
> [  449.794221] Emily:amdgpu_job_free_cb,Process information: process  pid 0 thread  pid 0, s_job:0000000066eb74ab
> [  449.794222] Emily:amdgpu_job_free_cb,Process information: process  pid 0 thread  pid 0, s_job:00000000d4438ad9
> [  449.794255] Emily:amdgpu_job_free_cb,Process information: process  pid 0 thread  pid 0, s_job:00000000b6d69c65
> [  449.794257] Emily:amdgpu_job_free_cb,Process information: process  pid 0 thread  pid 0, s_job:00000000ea85e922
> [  449.794287] Emily:amdgpu_job_free_cb,Process information: process  pid 0 thread  pid 0, s_job:00000000ed3a5ac6
> [  449.794366] BUG: unable to handle kernel NULL pointer dereference at 00000000000000c0
> [  449.800818] PGD 0 P4D 0
> [  449.801040] Oops: 0000 [#1] SMP PTI
> [  449.801338] CPU: 3 PID: 55 Comm: kworker/3:1 Tainted: G           OE     4.18.0-15-generic #16~18.04.1-Ubuntu
> [  449.802157] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
> [  449.802944] Workqueue: events drm_sched_job_timedout [amd_sched]
> [  449.803488] RIP: 0010:amdgpu_device_gpu_recover+0x1da/0xb60 [amdgpu]
> [  449.804020] Code: dd ff ff 49 39 c5 48 89 55 a8 0f 85 56 ff ff ff 45 85 e4 0f 85 a1 00 00 00 48 8b 45 b0 48 85 c0 0f 84 60 01 00 00 48 8b 40 10 <48> 8b 98 c0 00         00 00 48 85 db 0f 84 4c 01 00 00 48 8b 43 48 a8 01
> [  449.805593] RSP: 0018:ffffb4c7c08f7d68 EFLAGS: 00010286
> [  449.806032] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
> [  449.806625] RDX: ffffb4c7c08f5ac0 RSI: 0000000fffffffe0 RDI: 0000000000000246
> [  449.807224] RBP: ffffb4c7c08f7de0 R08: 00000068b9d54000 R09: 0000000000000000
> [  449.807818] R10: 0000000000000000 R11: 0000000000000148 R12: 0000000000000000
> [  449.808411] R13: ffffb4c7c08f7da0 R14: ffff8d82b8525d40 R15: ffff8d82b8525d40
> [  449.809004] FS:  0000000000000000(0000) GS:ffff8d82bfd80000(0000) knlGS:0000000000000000
> [  449.809674] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  449.810153] CR2: 00000000000000c0 CR3: 000000003cc0a001 CR4: 00000000003606e0
> [  449.810747] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [  449.811344] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [  449.811937] Call Trace:
> [  449.812206]  amdgpu_job_timedout+0x114/0x140 [amdgpu]
> [  449.812635]  drm_sched_job_timedout+0x44/0x90 [amd_sched]
> [  449.813139]  ? amdgpu_cgs_destroy_device+0x10/0x10 [amdgpu]
> [  449.813609]  ? drm_sched_job_timedout+0x44/0x90 [amd_sched]
> [  449.814077]  process_one_work+0x1fd/0x3f0
> [  449.814417]  worker_thread+0x34/0x410
> [  449.814728]  kthread+0x121/0x140
> [  449.815004]  ? process_one_work+0x3f0/0x3f0
> [  449.815374]  ? kthread_create_worker_on_cpu+0x70/0x70
> [  449.815799]  ret_from_fork+0x35/0x40
>
>> -----Original Message-----
>> From: Koenig, Christian <Christian.Koenig@xxxxxxx>
>> Sent: Friday, November 8, 2019 5:43 PM
>> To: Deng, Emily <Emily.Deng@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx
>> Subject: Re: [PATCH] drm/amdgpu: Fix the null pointer issue for tdr
>>
>> Am 08.11.19 um 10:39 schrieb Deng, Emily:
>>> Sorry, please take your time.
>> Have you seen my other response a bit below?
>>
>> I can't follow how it would be possible for job->s_fence to be NULL without
>> the job also being freed.
>>
>> So it looks like this patch is just papering over some bigger issues.
>>
>> Regards,
>> Christian.
>>
>>> Best wishes
>>> Emily Deng
>>>
>>>
>>>
>>>> -----Original Message-----
>>>> From: Koenig, Christian <Christian.Koenig@xxxxxxx>
>>>> Sent: Friday, November 8, 2019 5:08 PM
>>>> To: Deng, Emily <Emily.Deng@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx
>>>> Subject: Re: [PATCH] drm/amdgpu: Fix the null pointer issue for tdr
>>>>
>>>> Am 08.11.19 um 09:52 schrieb Deng, Emily:
>>>>> Ping.....
>>>> You need to give me at least enough time to wake up :)
>>>>
>>>>> Best wishes
>>>>> Emily Deng
>>>>>
>>>>>
>>>>>
>>>>>> -----Original Message-----
>>>>>> From: amd-gfx <amd-gfx-bounces@xxxxxxxxxxxxxxxxxxxxx> On Behalf Of
>>>>>> Deng, Emily
>>>>>> Sent: Friday, November 8, 2019 10:56 AM
>>>>>> To: Koenig, Christian <Christian.Koenig@xxxxxxx>; amd-
>>>>>> gfx@xxxxxxxxxxxxxxxxxxxxx
>>>>>> Subject: RE: [PATCH] drm/amdgpu: Fix the null pointer issue for tdr
>>>>>>
>>>>>>> -----Original Message-----
>>>>>>> From: Christian König <ckoenig.leichtzumerken@xxxxxxxxx>
>>>>>>> Sent: Thursday, November 7, 2019 7:28 PM
>>>>>>> To: Deng, Emily <Emily.Deng@xxxxxxx>;
>>>>>>> amd-gfx@xxxxxxxxxxxxxxxxxxxxx
>>>>>>> Subject: Re: [PATCH] drm/amdgpu: Fix the null pointer issue for
>>>>>>> tdr
>>>>>>>
>>>>>>> Am 07.11.19 um 11:25 schrieb Emily Deng:
>>>>>>>> When the job is already signaled, the s_fence is freed. Then it
>>>>>>>> will has null pointer in amdgpu_device_gpu_recover.
>>>>>>> NAK, the s_fence is only set to NULL when the job is destroyed.
>>>>>>> See drm_sched_job_cleanup().
>>>>>> I know it is set to NULL in drm_sched_job_cleanup. But in one case,
>>>>>> when it enter into the amdgpu_device_gpu_recover, it already in
>>>>>> drm_sched_job_cleanup, and at this time, it will go to free job.
>>>>>> But the amdgpu_device_gpu_recover sometimes is faster. At that
>>>>>> time, job is not freed, but s_fence is already NULL.
>>>> No, that case can't happen. See here:
>>>>
>>>>>           drm_sched_job_cleanup(s_job);
>>>>>
>>>>>           amdgpu_ring_priority_put(ring, s_job->s_priority);
>>>>>           dma_fence_put(job->fence);
>>>>>           amdgpu_sync_free(&job->sync);
>>>>>           amdgpu_sync_free(&job->sched_sync);
>>>>>           kfree(job);
>>>> The job itself is freed up directly after freeing the reference to the s_fence.
>>>>
>>>> So you are just papering over a much bigger problem here. This patch
>>>> is a clear NAK.
>>>>
>>>> Regards,
>>>> Christian.
>>>>
>>>>>>> When you see a job without an s_fence then that means the problem
>>>>>>> is somewhere else.
>>>>>>>
>>>>>>> Regards,
>>>>>>> Christian.
>>>>>>>
>>>>>>>> Signed-off-by: Emily Deng <Emily.Deng@xxxxxxx>
>>>>>>>> ---
>>>>>>>>      drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  2 +-
>>>>>>>>      drivers/gpu/drm/scheduler/sched_main.c     | 11 ++++++-----
>>>>>>>>      2 files changed, 7 insertions(+), 6 deletions(-)
>>>>>>>>
>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>> index e6ce949..5a8f08e 100644
>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>> @@ -4075,7 +4075,7 @@ int amdgpu_device_gpu_recover(struct
>>>>>>> amdgpu_device *adev,
>>>>>>>>      	 *
>>>>>>>>      	 * job->base holds a reference to parent fence
>>>>>>>>      	 */
>>>>>>>> -	if (job && job->base.s_fence->parent &&
>>>>>>>> +	if (job && job->base.s_fence && job->base.s_fence->parent
>> &&
>>>>>>>>      	    dma_fence_is_signaled(job->base.s_fence->parent))
>>>>>>>>      		job_signaled = true;
>>>>>>>>
>>>>>>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>>> b/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>>> index 31809ca..56cc10e 100644
>>>>>>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>>> @@ -334,8 +334,8 @@ void drm_sched_increase_karma(struct
>>>>>>> drm_sched_job
>>>>>>>> *bad)
>>>>>>>>
>>>>>>>>      			spin_lock(&rq->lock);
>>>>>>>>      			list_for_each_entry_safe(entity, tmp, &rq-
>>> entities,
>>>>>>> list) {
>>>>>>>> -				if (bad->s_fence->scheduled.context
>> ==
>>>>>>>> -				    entity->fence_context) {
>>>>>>>> +				if (bad->s_fence && (bad->s_fence-
>>>>>>>> scheduled.context ==
>>>>>>>> +				    entity->fence_context)) {
>>>>>>>>      					if (atomic_read(&bad-
>>> karma) >
>>>>>>>>      					    bad->sched->hang_limit)
>>>>>>>>      						if (entity->guilty)
>>>>>>>> @@ -376,7 +376,7 @@ void drm_sched_stop(struct
>> drm_gpu_scheduler
>>>>>>> *sched, struct drm_sched_job *bad)
>>>>>>>>      	 * This iteration is thread safe as sched thread is stopped.
>>>>>>>>      	 */
>>>>>>>>      	list_for_each_entry_safe_reverse(s_job, tmp, &sched-
>>>>>>>> ring_mirror_list, node) {
>>>>>>>> -		if (s_job->s_fence->parent &&
>>>>>>>> +		if (s_job->s_fence && s_job->s_fence->parent &&
>>>>>>>>      		    dma_fence_remove_callback(s_job->s_fence-
>>> parent,
>>>>>>>>      					      &s_job->cb)) {
>>>>>>>>      			atomic_dec(&sched->hw_rq_count); @@ -
>> 395,7
>>>>>> +395,8 @@ void
>>>>>>>> drm_sched_stop(struct drm_gpu_scheduler
>>>>>>> *sched, struct drm_sched_job *bad)
>>>>>>>>      			 *
>>>>>>>>      			 * Job is still alive so fence refcount at least 1
>>>>>>>>      			 */
>>>>>>>> -			dma_fence_wait(&s_job->s_fence->finished,
>> false);
>>>>>>>> +			if (s_job->s_fence)
>>>>>>>> +				dma_fence_wait(&s_job->s_fence-
>>> finished,
>>>>>>> false);
>>>>>>>>      			/*
>>>>>>>>      			 * We must keep bad job alive for later use
>> during @@
>>>>>>> -438,7
>>>>>>>> +439,7 @@ void drm_sched_start(struct drm_gpu_scheduler *sched,
>>>>>>>> +bool
>>>>>>> full_recovery)
>>>>>>>>      	 * GPU recovers can't run in parallel.
>>>>>>>>      	 */
>>>>>>>>      	list_for_each_entry_safe(s_job, tmp,
>>>>>>>> &sched->ring_mirror_list,
>>>>>>>> node)
>>>>>>> {
>>>>>>>> -		struct dma_fence *fence = s_job->s_fence->parent;
>>>>>>>> +		struct dma_fence *fence = s_job->s_fence ? s_job-
>>> s_fence-
>>>>>>>> parent :
>>>>>>>> +NULL;
>>>>>>>>
>>>>>>>>      		atomic_inc(&sched->hw_rq_count);
>>>>>>>>
>>>>>> _______________________________________________
>>>>>> amd-gfx mailing list
>>>>>> amd-gfx@xxxxxxxxxxxxxxxxxxxxx
>>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx

_______________________________________________
amd-gfx mailing list
amd-gfx@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/amd-gfx




[Index of Archives]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux