Am 08.11.19 um 09:52 schrieb Deng, Emily: > Ping..... You need to give me at least enough time to wake up :) > > > Best wishes > Emily Deng > > > >> -----Original Message----- >> From: amd-gfx <amd-gfx-bounces@xxxxxxxxxxxxxxxxxxxxx> On Behalf Of Deng, >> Emily >> Sent: Friday, November 8, 2019 10:56 AM >> To: Koenig, Christian <Christian.Koenig@xxxxxxx>; amd- >> gfx@xxxxxxxxxxxxxxxxxxxxx >> Subject: RE: [PATCH] drm/amdgpu: Fix the null pointer issue for tdr >> >>> -----Original Message----- >>> From: Christian König <ckoenig.leichtzumerken@xxxxxxxxx> >>> Sent: Thursday, November 7, 2019 7:28 PM >>> To: Deng, Emily <Emily.Deng@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx >>> Subject: Re: [PATCH] drm/amdgpu: Fix the null pointer issue for tdr >>> >>> Am 07.11.19 um 11:25 schrieb Emily Deng: >>>> When the job is already signaled, the s_fence is freed. Then it will >>>> has null pointer in amdgpu_device_gpu_recover. >>> NAK, the s_fence is only set to NULL when the job is destroyed. See >>> drm_sched_job_cleanup(). >> I know it is set to NULL in drm_sched_job_cleanup. But in one case, when it >> enter into the amdgpu_device_gpu_recover, it already in >> drm_sched_job_cleanup, and at this time, it will go to free job. But the >> amdgpu_device_gpu_recover sometimes is faster. At that time, job is not >> freed, but s_fence is already NULL. No, that case can't happen. See here: > drm_sched_job_cleanup(s_job); > > amdgpu_ring_priority_put(ring, s_job->s_priority); > dma_fence_put(job->fence); > amdgpu_sync_free(&job->sync); > amdgpu_sync_free(&job->sched_sync); > kfree(job); The job itself is freed up directly after freeing the reference to the s_fence. So you are just papering over a much bigger problem here. This patch is a clear NAK. Regards, Christian. >>> When you see a job without an s_fence then that means the problem is >>> somewhere else. >>> >>> Regards, >>> Christian. >>> >>>> Signed-off-by: Emily Deng <Emily.Deng@xxxxxxx> >>>> --- >>>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 2 +- >>>> drivers/gpu/drm/scheduler/sched_main.c | 11 ++++++----- >>>> 2 files changed, 7 insertions(+), 6 deletions(-) >>>> >>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>>> index e6ce949..5a8f08e 100644 >>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>>> @@ -4075,7 +4075,7 @@ int amdgpu_device_gpu_recover(struct >>> amdgpu_device *adev, >>>> * >>>> * job->base holds a reference to parent fence >>>> */ >>>> - if (job && job->base.s_fence->parent && >>>> + if (job && job->base.s_fence && job->base.s_fence->parent && >>>> dma_fence_is_signaled(job->base.s_fence->parent)) >>>> job_signaled = true; >>>> >>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c >>>> b/drivers/gpu/drm/scheduler/sched_main.c >>>> index 31809ca..56cc10e 100644 >>>> --- a/drivers/gpu/drm/scheduler/sched_main.c >>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c >>>> @@ -334,8 +334,8 @@ void drm_sched_increase_karma(struct >>> drm_sched_job >>>> *bad) >>>> >>>> spin_lock(&rq->lock); >>>> list_for_each_entry_safe(entity, tmp, &rq->entities, >>> list) { >>>> - if (bad->s_fence->scheduled.context == >>>> - entity->fence_context) { >>>> + if (bad->s_fence && (bad->s_fence- >>>> scheduled.context == >>>> + entity->fence_context)) { >>>> if (atomic_read(&bad->karma) > >>>> bad->sched->hang_limit) >>>> if (entity->guilty) >>>> @@ -376,7 +376,7 @@ void drm_sched_stop(struct drm_gpu_scheduler >>> *sched, struct drm_sched_job *bad) >>>> * This iteration is thread safe as sched thread is stopped. >>>> */ >>>> list_for_each_entry_safe_reverse(s_job, tmp, &sched- >>>> ring_mirror_list, node) { >>>> - if (s_job->s_fence->parent && >>>> + if (s_job->s_fence && s_job->s_fence->parent && >>>> dma_fence_remove_callback(s_job->s_fence->parent, >>>> &s_job->cb)) { >>>> atomic_dec(&sched->hw_rq_count); @@ -395,7 >> +395,8 @@ void >>>> drm_sched_stop(struct drm_gpu_scheduler >>> *sched, struct drm_sched_job *bad) >>>> * >>>> * Job is still alive so fence refcount at least 1 >>>> */ >>>> - dma_fence_wait(&s_job->s_fence->finished, false); >>>> + if (s_job->s_fence) >>>> + dma_fence_wait(&s_job->s_fence->finished, >>> false); >>>> /* >>>> * We must keep bad job alive for later use during @@ >>> -438,7 >>>> +439,7 @@ void drm_sched_start(struct drm_gpu_scheduler *sched, bool >>> full_recovery) >>>> * GPU recovers can't run in parallel. >>>> */ >>>> list_for_each_entry_safe(s_job, tmp, &sched->ring_mirror_list, >>>> node) >>> { >>>> - struct dma_fence *fence = s_job->s_fence->parent; >>>> + struct dma_fence *fence = s_job->s_fence ? s_job->s_fence- >>>> parent : >>>> +NULL; >>>> >>>> atomic_inc(&sched->hw_rq_count); >>>> >> _______________________________________________ >> amd-gfx mailing list >> amd-gfx@xxxxxxxxxxxxxxxxxxxxx >> https://lists.freedesktop.org/mailman/listinfo/amd-gfx _______________________________________________ amd-gfx mailing list amd-gfx@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/amd-gfx