On 2017å¹´09æ??28æ?¥ 22:55, Nicolai Hähnle wrote: > From: Nicolai Hähnle <nicolai.haehnle at amd.com> > > Highly concurrent Piglit runs can trigger a race condition where a pending > SDMA job on a buffer object is never executed because the corresponding > process is killed (perhaps due to a crash). Since the job's fences were > never signaled, the buffer object was effectively leaked. Worse, the > buffer was stuck wherever it happened to be at the time, possibly in VRAM. Indeed good catch. Cheers, David Zhou > > The symptom was user space processes stuck in interruptible waits with > kernel stacks like: > > [<ffffffffbc5e6722>] dma_fence_default_wait+0x112/0x250 > [<ffffffffbc5e6399>] dma_fence_wait_timeout+0x39/0xf0 > [<ffffffffbc5e82d2>] reservation_object_wait_timeout_rcu+0x1c2/0x300 > [<ffffffffc03ce56f>] ttm_bo_cleanup_refs_and_unlock+0xff/0x1a0 [ttm] > [<ffffffffc03cf1ea>] ttm_mem_evict_first+0xba/0x1a0 [ttm] > [<ffffffffc03cf611>] ttm_bo_mem_space+0x341/0x4c0 [ttm] > [<ffffffffc03cfc54>] ttm_bo_validate+0xd4/0x150 [ttm] > [<ffffffffc03cffbd>] ttm_bo_init_reserved+0x2ed/0x420 [ttm] > [<ffffffffc042f523>] amdgpu_bo_create_restricted+0x1f3/0x470 [amdgpu] > [<ffffffffc042f9fa>] amdgpu_bo_create+0xda/0x220 [amdgpu] > [<ffffffffc04349ea>] amdgpu_gem_object_create+0xaa/0x140 [amdgpu] > [<ffffffffc0434f97>] amdgpu_gem_create_ioctl+0x97/0x120 [amdgpu] > [<ffffffffc037ddba>] drm_ioctl+0x1fa/0x480 [drm] > [<ffffffffc041904f>] amdgpu_drm_ioctl+0x4f/0x90 [amdgpu] > [<ffffffffbc23db33>] do_vfs_ioctl+0xa3/0x5f0 > [<ffffffffbc23e0f9>] SyS_ioctl+0x79/0x90 > [<ffffffffbc864ffb>] entry_SYSCALL_64_fastpath+0x1e/0xad > [<ffffffffffffffff>] 0xffffffffffffffff > > Signed-off-by: Nicolai Hähnle <nicolai.haehnle at amd.com> > Acked-by: Christian König <christian.koenig at amd.com> > --- > drivers/gpu/drm/amd/scheduler/gpu_scheduler.c | 7 ++++++- > 1 file changed, 6 insertions(+), 1 deletion(-) > > diff --git a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c > index 54eb77cffd9b..32a99e980d78 100644 > --- a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c > +++ b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c > @@ -220,22 +220,27 @@ void amd_sched_entity_fini(struct amd_gpu_scheduler *sched, > amd_sched_entity_is_idle(entity)); > amd_sched_rq_remove_entity(rq, entity); > if (r) { > struct amd_sched_job *job; > > /* Park the kernel for a moment to make sure it isn't processing > * our enity. > */ > kthread_park(sched->thread); > kthread_unpark(sched->thread); > - while (kfifo_out(&entity->job_queue, &job, sizeof(job))) > + while (kfifo_out(&entity->job_queue, &job, sizeof(job))) { > + struct amd_sched_fence *s_fence = job->s_fence; > + amd_sched_fence_scheduled(s_fence); > + amd_sched_fence_finished(s_fence); > + dma_fence_put(&s_fence->finished); > sched->ops->free_job(job); > + } > > } > kfifo_free(&entity->job_queue); > } > > static void amd_sched_entity_wakeup(struct dma_fence *f, struct dma_fence_cb *cb) > { > struct amd_sched_entity *entity = > container_of(cb, struct amd_sched_entity, cb); > entity->dependency = NULL;