Sorry, please take your time. Best wishes Emily Deng >-----Original Message----- >From: Koenig, Christian <Christian.Koenig@xxxxxxx> >Sent: Friday, November 8, 2019 5:08 PM >To: Deng, Emily <Emily.Deng@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx >Subject: Re: [PATCH] drm/amdgpu: Fix the null pointer issue for tdr > >Am 08.11.19 um 09:52 schrieb Deng, Emily: >> Ping..... > >You need to give me at least enough time to wake up :) > >> >> >> Best wishes >> Emily Deng >> >> >> >>> -----Original Message----- >>> From: amd-gfx <amd-gfx-bounces@xxxxxxxxxxxxxxxxxxxxx> On Behalf Of >>> Deng, Emily >>> Sent: Friday, November 8, 2019 10:56 AM >>> To: Koenig, Christian <Christian.Koenig@xxxxxxx>; amd- >>> gfx@xxxxxxxxxxxxxxxxxxxxx >>> Subject: RE: [PATCH] drm/amdgpu: Fix the null pointer issue for tdr >>> >>>> -----Original Message----- >>>> From: Christian König <ckoenig.leichtzumerken@xxxxxxxxx> >>>> Sent: Thursday, November 7, 2019 7:28 PM >>>> To: Deng, Emily <Emily.Deng@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx >>>> Subject: Re: [PATCH] drm/amdgpu: Fix the null pointer issue for tdr >>>> >>>> Am 07.11.19 um 11:25 schrieb Emily Deng: >>>>> When the job is already signaled, the s_fence is freed. Then it >>>>> will has null pointer in amdgpu_device_gpu_recover. >>>> NAK, the s_fence is only set to NULL when the job is destroyed. See >>>> drm_sched_job_cleanup(). >>> I know it is set to NULL in drm_sched_job_cleanup. But in one case, >>> when it enter into the amdgpu_device_gpu_recover, it already in >>> drm_sched_job_cleanup, and at this time, it will go to free job. But >>> the amdgpu_device_gpu_recover sometimes is faster. At that time, job >>> is not freed, but s_fence is already NULL. > >No, that case can't happen. See here: > >> drm_sched_job_cleanup(s_job); >> >> amdgpu_ring_priority_put(ring, s_job->s_priority); >> dma_fence_put(job->fence); >> amdgpu_sync_free(&job->sync); >> amdgpu_sync_free(&job->sched_sync); >> kfree(job); > >The job itself is freed up directly after freeing the reference to the s_fence. > >So you are just papering over a much bigger problem here. This patch is a >clear NAK. > >Regards, >Christian. > >>>> When you see a job without an s_fence then that means the problem is >>>> somewhere else. >>>> >>>> Regards, >>>> Christian. >>>> >>>>> Signed-off-by: Emily Deng <Emily.Deng@xxxxxxx> >>>>> --- >>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 2 +- >>>>> drivers/gpu/drm/scheduler/sched_main.c | 11 ++++++----- >>>>> 2 files changed, 7 insertions(+), 6 deletions(-) >>>>> >>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>>>> index e6ce949..5a8f08e 100644 >>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>>>> @@ -4075,7 +4075,7 @@ int amdgpu_device_gpu_recover(struct >>>> amdgpu_device *adev, >>>>> * >>>>> * job->base holds a reference to parent fence >>>>> */ >>>>> - if (job && job->base.s_fence->parent && >>>>> + if (job && job->base.s_fence && job->base.s_fence->parent && >>>>> dma_fence_is_signaled(job->base.s_fence->parent)) >>>>> job_signaled = true; >>>>> >>>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c >>>>> b/drivers/gpu/drm/scheduler/sched_main.c >>>>> index 31809ca..56cc10e 100644 >>>>> --- a/drivers/gpu/drm/scheduler/sched_main.c >>>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c >>>>> @@ -334,8 +334,8 @@ void drm_sched_increase_karma(struct >>>> drm_sched_job >>>>> *bad) >>>>> >>>>> spin_lock(&rq->lock); >>>>> list_for_each_entry_safe(entity, tmp, &rq->entities, >>>> list) { >>>>> - if (bad->s_fence->scheduled.context == >>>>> - entity->fence_context) { >>>>> + if (bad->s_fence && (bad->s_fence- >>>>> scheduled.context == >>>>> + entity->fence_context)) { >>>>> if (atomic_read(&bad->karma) > >>>>> bad->sched->hang_limit) >>>>> if (entity->guilty) >>>>> @@ -376,7 +376,7 @@ void drm_sched_stop(struct drm_gpu_scheduler >>>> *sched, struct drm_sched_job *bad) >>>>> * This iteration is thread safe as sched thread is stopped. >>>>> */ >>>>> list_for_each_entry_safe_reverse(s_job, tmp, &sched- >>>>> ring_mirror_list, node) { >>>>> - if (s_job->s_fence->parent && >>>>> + if (s_job->s_fence && s_job->s_fence->parent && >>>>> dma_fence_remove_callback(s_job->s_fence->parent, >>>>> &s_job->cb)) { >>>>> atomic_dec(&sched->hw_rq_count); @@ -395,7 >>> +395,8 @@ void >>>>> drm_sched_stop(struct drm_gpu_scheduler >>>> *sched, struct drm_sched_job *bad) >>>>> * >>>>> * Job is still alive so fence refcount at least 1 >>>>> */ >>>>> - dma_fence_wait(&s_job->s_fence->finished, false); >>>>> + if (s_job->s_fence) >>>>> + dma_fence_wait(&s_job->s_fence->finished, >>>> false); >>>>> /* >>>>> * We must keep bad job alive for later use during @@ >>>> -438,7 >>>>> +439,7 @@ void drm_sched_start(struct drm_gpu_scheduler *sched, >>>>> +bool >>>> full_recovery) >>>>> * GPU recovers can't run in parallel. >>>>> */ >>>>> list_for_each_entry_safe(s_job, tmp, &sched->ring_mirror_list, >>>>> node) >>>> { >>>>> - struct dma_fence *fence = s_job->s_fence->parent; >>>>> + struct dma_fence *fence = s_job->s_fence ? s_job->s_fence- >>>>> parent : >>>>> +NULL; >>>>> >>>>> atomic_inc(&sched->hw_rq_count); >>>>> >>> _______________________________________________ >>> amd-gfx mailing list >>> amd-gfx@xxxxxxxxxxxxxxxxxxxxx >>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx _______________________________________________ amd-gfx mailing list amd-gfx@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/amd-gfx