On 28/09/17 04:55 PM, Nicolai Hähnle wrote: > From: Nicolai Hähnle <nicolai.haehnle at amd.com> > > Highly concurrent Piglit runs can trigger a race condition where a pending > SDMA job on a buffer object is never executed because the corresponding > process is killed (perhaps due to a crash). Since the job's fences were > never signaled, the buffer object was effectively leaked. Worse, the > buffer was stuck wherever it happened to be at the time, possibly in VRAM. > > The symptom was user space processes stuck in interruptible waits with > kernel stacks like: > > [<ffffffffbc5e6722>] dma_fence_default_wait+0x112/0x250 > [<ffffffffbc5e6399>] dma_fence_wait_timeout+0x39/0xf0 > [<ffffffffbc5e82d2>] reservation_object_wait_timeout_rcu+0x1c2/0x300 > [<ffffffffc03ce56f>] ttm_bo_cleanup_refs_and_unlock+0xff/0x1a0 [ttm] > [<ffffffffc03cf1ea>] ttm_mem_evict_first+0xba/0x1a0 [ttm] > [<ffffffffc03cf611>] ttm_bo_mem_space+0x341/0x4c0 [ttm] > [<ffffffffc03cfc54>] ttm_bo_validate+0xd4/0x150 [ttm] > [<ffffffffc03cffbd>] ttm_bo_init_reserved+0x2ed/0x420 [ttm] > [<ffffffffc042f523>] amdgpu_bo_create_restricted+0x1f3/0x470 [amdgpu] > [<ffffffffc042f9fa>] amdgpu_bo_create+0xda/0x220 [amdgpu] > [<ffffffffc04349ea>] amdgpu_gem_object_create+0xaa/0x140 [amdgpu] > [<ffffffffc0434f97>] amdgpu_gem_create_ioctl+0x97/0x120 [amdgpu] > [<ffffffffc037ddba>] drm_ioctl+0x1fa/0x480 [drm] > [<ffffffffc041904f>] amdgpu_drm_ioctl+0x4f/0x90 [amdgpu] > [<ffffffffbc23db33>] do_vfs_ioctl+0xa3/0x5f0 > [<ffffffffbc23e0f9>] SyS_ioctl+0x79/0x90 > [<ffffffffbc864ffb>] entry_SYSCALL_64_fastpath+0x1e/0xad > [<ffffffffffffffff>] 0xffffffffffffffff > > Signed-off-by: Nicolai Hähnle <nicolai.haehnle at amd.com> > Acked-by: Christian König <christian.koenig at amd.com> Since Christian's commit which introduced the problem (6af0883ed977 "drm/amdgpu: discard commands of killed processes") is in 4.14, we need a solution for that. Should we backport Nicolai's five commits fixing the problem, or revert 6af0883ed977? While looking into this, I noticed that the following commits by Christian in 4.14 each also cause hangs for me when running the piglit gpu profile on Tonga: 457e0fee04b0 "drm/amdgpu: remove the GART copy hack" 1d00402b4da2 "drm/amdgpu: fix amdgpu_ttm_bind" Are there fixes for these that can be backported to 4.14, or do they need to be reverted there? -- Earthling Michel Dänzer | http://www.amd.com Libre software enthusiast | Mesa and X developer