Am 08.11.19 um 10:39 schrieb Deng, Emily: > Sorry, please take your time. Have you seen my other response a bit below? I can't follow how it would be possible for job->s_fence to be NULL without the job also being freed. So it looks like this patch is just papering over some bigger issues. Regards, Christian. > > Best wishes > Emily Deng > > > >> -----Original Message----- >> From: Koenig, Christian <Christian.Koenig@xxxxxxx> >> Sent: Friday, November 8, 2019 5:08 PM >> To: Deng, Emily <Emily.Deng@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx >> Subject: Re: [PATCH] drm/amdgpu: Fix the null pointer issue for tdr >> >> Am 08.11.19 um 09:52 schrieb Deng, Emily: >>> Ping..... >> You need to give me at least enough time to wake up :) >> >>> >>> Best wishes >>> Emily Deng >>> >>> >>> >>>> -----Original Message----- >>>> From: amd-gfx <amd-gfx-bounces@xxxxxxxxxxxxxxxxxxxxx> On Behalf Of >>>> Deng, Emily >>>> Sent: Friday, November 8, 2019 10:56 AM >>>> To: Koenig, Christian <Christian.Koenig@xxxxxxx>; amd- >>>> gfx@xxxxxxxxxxxxxxxxxxxxx >>>> Subject: RE: [PATCH] drm/amdgpu: Fix the null pointer issue for tdr >>>> >>>>> -----Original Message----- >>>>> From: Christian König <ckoenig.leichtzumerken@xxxxxxxxx> >>>>> Sent: Thursday, November 7, 2019 7:28 PM >>>>> To: Deng, Emily <Emily.Deng@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx >>>>> Subject: Re: [PATCH] drm/amdgpu: Fix the null pointer issue for tdr >>>>> >>>>> Am 07.11.19 um 11:25 schrieb Emily Deng: >>>>>> When the job is already signaled, the s_fence is freed. Then it >>>>>> will has null pointer in amdgpu_device_gpu_recover. >>>>> NAK, the s_fence is only set to NULL when the job is destroyed. See >>>>> drm_sched_job_cleanup(). >>>> I know it is set to NULL in drm_sched_job_cleanup. But in one case, >>>> when it enter into the amdgpu_device_gpu_recover, it already in >>>> drm_sched_job_cleanup, and at this time, it will go to free job. But >>>> the amdgpu_device_gpu_recover sometimes is faster. At that time, job >>>> is not freed, but s_fence is already NULL. >> No, that case can't happen. See here: >> >>> drm_sched_job_cleanup(s_job); >>> >>> amdgpu_ring_priority_put(ring, s_job->s_priority); >>> dma_fence_put(job->fence); >>> amdgpu_sync_free(&job->sync); >>> amdgpu_sync_free(&job->sched_sync); >>> kfree(job); >> The job itself is freed up directly after freeing the reference to the s_fence. >> >> So you are just papering over a much bigger problem here. This patch is a >> clear NAK. >> >> Regards, >> Christian. >> >>>>> When you see a job without an s_fence then that means the problem is >>>>> somewhere else. >>>>> >>>>> Regards, >>>>> Christian. >>>>> >>>>>> Signed-off-by: Emily Deng <Emily.Deng@xxxxxxx> >>>>>> --- >>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 2 +- >>>>>> drivers/gpu/drm/scheduler/sched_main.c | 11 ++++++----- >>>>>> 2 files changed, 7 insertions(+), 6 deletions(-) >>>>>> >>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>>>>> index e6ce949..5a8f08e 100644 >>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>>>>> @@ -4075,7 +4075,7 @@ int amdgpu_device_gpu_recover(struct >>>>> amdgpu_device *adev, >>>>>> * >>>>>> * job->base holds a reference to parent fence >>>>>> */ >>>>>> - if (job && job->base.s_fence->parent && >>>>>> + if (job && job->base.s_fence && job->base.s_fence->parent && >>>>>> dma_fence_is_signaled(job->base.s_fence->parent)) >>>>>> job_signaled = true; >>>>>> >>>>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c >>>>>> b/drivers/gpu/drm/scheduler/sched_main.c >>>>>> index 31809ca..56cc10e 100644 >>>>>> --- a/drivers/gpu/drm/scheduler/sched_main.c >>>>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c >>>>>> @@ -334,8 +334,8 @@ void drm_sched_increase_karma(struct >>>>> drm_sched_job >>>>>> *bad) >>>>>> >>>>>> spin_lock(&rq->lock); >>>>>> list_for_each_entry_safe(entity, tmp, &rq->entities, >>>>> list) { >>>>>> - if (bad->s_fence->scheduled.context == >>>>>> - entity->fence_context) { >>>>>> + if (bad->s_fence && (bad->s_fence- >>>>>> scheduled.context == >>>>>> + entity->fence_context)) { >>>>>> if (atomic_read(&bad->karma) > >>>>>> bad->sched->hang_limit) >>>>>> if (entity->guilty) >>>>>> @@ -376,7 +376,7 @@ void drm_sched_stop(struct drm_gpu_scheduler >>>>> *sched, struct drm_sched_job *bad) >>>>>> * This iteration is thread safe as sched thread is stopped. >>>>>> */ >>>>>> list_for_each_entry_safe_reverse(s_job, tmp, &sched- >>>>>> ring_mirror_list, node) { >>>>>> - if (s_job->s_fence->parent && >>>>>> + if (s_job->s_fence && s_job->s_fence->parent && >>>>>> dma_fence_remove_callback(s_job->s_fence->parent, >>>>>> &s_job->cb)) { >>>>>> atomic_dec(&sched->hw_rq_count); @@ -395,7 >>>> +395,8 @@ void >>>>>> drm_sched_stop(struct drm_gpu_scheduler >>>>> *sched, struct drm_sched_job *bad) >>>>>> * >>>>>> * Job is still alive so fence refcount at least 1 >>>>>> */ >>>>>> - dma_fence_wait(&s_job->s_fence->finished, false); >>>>>> + if (s_job->s_fence) >>>>>> + dma_fence_wait(&s_job->s_fence->finished, >>>>> false); >>>>>> /* >>>>>> * We must keep bad job alive for later use during @@ >>>>> -438,7 >>>>>> +439,7 @@ void drm_sched_start(struct drm_gpu_scheduler *sched, >>>>>> +bool >>>>> full_recovery) >>>>>> * GPU recovers can't run in parallel. >>>>>> */ >>>>>> list_for_each_entry_safe(s_job, tmp, &sched->ring_mirror_list, >>>>>> node) >>>>> { >>>>>> - struct dma_fence *fence = s_job->s_fence->parent; >>>>>> + struct dma_fence *fence = s_job->s_fence ? s_job->s_fence- >>>>>> parent : >>>>>> +NULL; >>>>>> >>>>>> atomic_inc(&sched->hw_rq_count); >>>>>> >>>> _______________________________________________ >>>> amd-gfx mailing list >>>> amd-gfx@xxxxxxxxxxxxxxxxxxxxx >>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx _______________________________________________ amd-gfx mailing list amd-gfx@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/amd-gfx