[PATCH 5/5] drm/amd/sched: signal and free remaining fences in amd_sched_entity_fini

nicolai.haehnle@xxxxxxx (Nicolai Hähnle) · Mon, 9 Oct 2017 12:14:29 +0200

On 09.10.2017 10:02, Christian KÃ¶nig wrote:
>> For gpu reset patches (already submitted to pub) I would make kernel 
>> return -ENODEV if the waiting fence (in cs_wait or wait_fences IOCTL) 
>> founded as error, that way UMD would run into robust extension path 
>> and considering the GPU hang occurred,
> Well that is only closed source behavior which is completely irrelevant 
> for upstream development.
> 
> As far as I know we haven't pushed the change to return -ENODEV upstream.

FWIW, radeonsi currently expects -ECANCELED on CS submissions and treats 
those as context lost. Perhaps we could use the same error on fences? 
That makes more sense to me than -ENODEV.

Cheers,
Nicolai

> 
> Regards,
> Christian.
> 
> Am 09.10.2017 um 08:42 schrieb Liu, Monk:
>> Christian
>>
>>> It would be really nice to have an error code set on 
>>> s_fence->finished before it is signaled, use dma_fence_set_error() 
>>> for this.
>> For gpu reset patches (already submitted to pub) I would make kernel 
>> return -ENODEV if the waiting fence (in cs_wait or wait_fences IOCTL) 
>> founded as error, that way UMD would run into robust extension path 
>> and considering the GPU hang occurred,
>>
>> Don't know if this is expected for the case of normal process being 
>> killed or crashed like Nicolai hit ... since there is no gpu hang hit
>>
>>
>> BR Monk
>>
>>
>>
>>
>> -----Original Message-----
>> From: amd-gfx [mailto:amd-gfx-bounces at lists.freedesktop.org] On Behalf 
>> Of Christian K?nig
>> Sent: 2017å¹´9æ??28æ?¥ 23:01
>> To: Nicolai HÃ¤hnle <nhaehnle at gmail.com>; amd-gfx at lists.freedesktop.org
>> Cc: Haehnle, Nicolai <Nicolai.Haehnle at amd.com>
>> Subject: Re: [PATCH 5/5] drm/amd/sched: signal and free remaining 
>> fences in amd_sched_entity_fini
>>
>> Am 28.09.2017 um 16:55 schrieb Nicolai HÃ¤hnle:
>>> From: Nicolai HÃ¤hnle <nicolai.haehnle at amd.com>
>>>
>>> Highly concurrent Piglit runs can trigger a race condition where a
>>> pending SDMA job on a buffer object is never executed because the
>>> corresponding process is killed (perhaps due to a crash). Since the
>>> job's fences were never signaled, the buffer object was effectively
>>> leaked. Worse, the buffer was stuck wherever it happened to be at the 
>>> time, possibly in VRAM.
>>>
>>> The symptom was user space processes stuck in interruptible waits with
>>> kernel stacks like:
>>>
>>> Â Â Â Â Â  [<ffffffffbc5e6722>] dma_fence_default_wait+0x112/0x250
>>> Â Â Â Â Â  [<ffffffffbc5e6399>] dma_fence_wait_timeout+0x39/0xf0
>>> Â Â Â Â Â  [<ffffffffbc5e82d2>] 
>>> reservation_object_wait_timeout_rcu+0x1c2/0x300
>>> Â Â Â Â Â  [<ffffffffc03ce56f>] ttm_bo_cleanup_refs_and_unlock+0xff/0x1a0 
>>> [ttm]
>>> Â Â Â Â Â  [<ffffffffc03cf1ea>] ttm_mem_evict_first+0xba/0x1a0 [ttm]
>>> Â Â Â Â Â  [<ffffffffc03cf611>] ttm_bo_mem_space+0x341/0x4c0 [ttm]
>>> Â Â Â Â Â  [<ffffffffc03cfc54>] ttm_bo_validate+0xd4/0x150 [ttm]
>>> Â Â Â Â Â  [<ffffffffc03cffbd>] ttm_bo_init_reserved+0x2ed/0x420 [ttm]
>>> Â Â Â Â Â  [<ffffffffc042f523>] amdgpu_bo_create_restricted+0x1f3/0x470 
>>> [amdgpu]
>>> Â Â Â Â Â  [<ffffffffc042f9fa>] amdgpu_bo_create+0xda/0x220 [amdgpu]
>>> Â Â Â Â Â  [<ffffffffc04349ea>] amdgpu_gem_object_create+0xaa/0x140 [amdgpu]
>>> Â Â Â Â Â  [<ffffffffc0434f97>] amdgpu_gem_create_ioctl+0x97/0x120 [amdgpu]
>>> Â Â Â Â Â  [<ffffffffc037ddba>] drm_ioctl+0x1fa/0x480 [drm]
>>> Â Â Â Â Â  [<ffffffffc041904f>] amdgpu_drm_ioctl+0x4f/0x90 [amdgpu]
>>> Â Â Â Â Â  [<ffffffffbc23db33>] do_vfs_ioctl+0xa3/0x5f0
>>> Â Â Â Â Â  [<ffffffffbc23e0f9>] SyS_ioctl+0x79/0x90
>>> Â Â Â Â Â  [<ffffffffbc864ffb>] entry_SYSCALL_64_fastpath+0x1e/0xad
>>> Â Â Â Â Â  [<ffffffffffffffff>] 0xffffffffffffffff
>>>
>>> Signed-off-by: Nicolai HÃ¤hnle <nicolai.haehnle at amd.com>
>>> Acked-by: Christian KÃ¶nig <christian.koenig at amd.com>
>>> ---
>>> Â Â  drivers/gpu/drm/amd/scheduler/gpu_scheduler.c | 7 ++++++-
>>> Â Â  1 file changed, 6 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
>>> b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
>>> index 54eb77cffd9b..32a99e980d78 100644
>>> --- a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
>>> +++ b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
>>> @@ -220,22 +220,27 @@ void amd_sched_entity_fini(struct 
>>> amd_gpu_scheduler *sched,
>>> Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â  amd_sched_entity_is_idle(entity));
>>> Â Â Â Â Â Â  amd_sched_rq_remove_entity(rq, entity);
>>> Â Â Â Â Â Â  if (r) {
>>> Â Â Â Â Â Â Â Â Â Â  struct amd_sched_job *job;
>>> Â Â Â Â Â Â Â Â Â Â  /* Park the kernel for a moment to make sure it isn't 
>>> processing
>>> Â Â Â Â Â Â Â Â Â Â Â  * our enity.
>>> Â Â Â Â Â Â Â Â Â Â Â  */
>>> Â Â Â Â Â Â Â Â Â Â  kthread_park(sched->thread);
>>> Â Â Â Â Â Â Â Â Â Â  kthread_unpark(sched->thread);
>>> -Â Â Â Â Â Â Â  while (kfifo_out(&entity->job_queue, &job, sizeof(job)))
>>> +Â Â Â Â Â Â Â  while (kfifo_out(&entity->job_queue, &job, sizeof(job))) {
>>> +Â Â Â Â Â Â Â Â Â Â Â  struct amd_sched_fence *s_fence = job->s_fence;
>>> +Â Â Â Â Â Â Â Â Â Â Â  amd_sched_fence_scheduled(s_fence);
>> It would be really nice to have an error code set on s_fence->finished 
>> before it is signaled, use dma_fence_set_error() for this.
>>
>> Additional to that it would be nice to note in the subject line that 
>> this is a rather important bug fix.
>>
>> With that fixed the whole series is Reviewed-by: Christian KÃ¶nig 
>> <christian.koenig at amd.com>.
>>
>> Regards,
>> Christian.
>>
>>> +Â Â Â Â Â Â Â Â Â Â Â  amd_sched_fence_finished(s_fence);
>>> +Â Â Â Â Â Â Â Â Â Â Â  dma_fence_put(&s_fence->finished);
>>> Â Â Â Â Â Â Â Â Â Â Â Â Â Â  sched->ops->free_job(job);
>>> +Â Â Â Â Â Â Â  }
>>> Â Â Â Â Â Â  }
>>> Â Â Â Â Â Â  kfifo_free(&entity->job_queue);
>>> Â Â  }
>>> Â Â  static void amd_sched_entity_wakeup(struct dma_fence *f, struct 
>>> dma_fence_cb *cb)
>>> Â Â  {
>>> Â Â Â Â Â Â  struct amd_sched_entity *entity =
>>> Â Â Â Â Â Â Â Â Â Â  container_of(cb, struct amd_sched_entity, cb);
>>> Â Â Â Â Â Â  entity->dependency = NULL;
>>
>> _______________________________________________
>> amd-gfx mailing list
>> amd-gfx at lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>> _______________________________________________
>> amd-gfx mailing list
>> amd-gfx at lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
> 
>