Am 11.10.2017 um 18:30 schrieb Michel Dänzer: > On 28/09/17 04:55 PM, Nicolai Hähnle wrote: >> From: Nicolai Hähnle <nicolai.haehnle at amd.com> >> >> Highly concurrent Piglit runs can trigger a race condition where a pending >> SDMA job on a buffer object is never executed because the corresponding >> process is killed (perhaps due to a crash). Since the job's fences were >> never signaled, the buffer object was effectively leaked. Worse, the >> buffer was stuck wherever it happened to be at the time, possibly in VRAM. >> >> The symptom was user space processes stuck in interruptible waits with >> kernel stacks like: >> >> [<ffffffffbc5e6722>] dma_fence_default_wait+0x112/0x250 >> [<ffffffffbc5e6399>] dma_fence_wait_timeout+0x39/0xf0 >> [<ffffffffbc5e82d2>] reservation_object_wait_timeout_rcu+0x1c2/0x300 >> [<ffffffffc03ce56f>] ttm_bo_cleanup_refs_and_unlock+0xff/0x1a0 [ttm] >> [<ffffffffc03cf1ea>] ttm_mem_evict_first+0xba/0x1a0 [ttm] >> [<ffffffffc03cf611>] ttm_bo_mem_space+0x341/0x4c0 [ttm] >> [<ffffffffc03cfc54>] ttm_bo_validate+0xd4/0x150 [ttm] >> [<ffffffffc03cffbd>] ttm_bo_init_reserved+0x2ed/0x420 [ttm] >> [<ffffffffc042f523>] amdgpu_bo_create_restricted+0x1f3/0x470 [amdgpu] >> [<ffffffffc042f9fa>] amdgpu_bo_create+0xda/0x220 [amdgpu] >> [<ffffffffc04349ea>] amdgpu_gem_object_create+0xaa/0x140 [amdgpu] >> [<ffffffffc0434f97>] amdgpu_gem_create_ioctl+0x97/0x120 [amdgpu] >> [<ffffffffc037ddba>] drm_ioctl+0x1fa/0x480 [drm] >> [<ffffffffc041904f>] amdgpu_drm_ioctl+0x4f/0x90 [amdgpu] >> [<ffffffffbc23db33>] do_vfs_ioctl+0xa3/0x5f0 >> [<ffffffffbc23e0f9>] SyS_ioctl+0x79/0x90 >> [<ffffffffbc864ffb>] entry_SYSCALL_64_fastpath+0x1e/0xad >> [<ffffffffffffffff>] 0xffffffffffffffff >> >> Signed-off-by: Nicolai Hähnle <nicolai.haehnle at amd.com> >> Acked-by: Christian König <christian.koenig at amd.com> > Since Christian's commit which introduced the problem (6af0883ed977 > "drm/amdgpu: discard commands of killed processes") is in 4.14, we need > a solution for that. Should we backport Nicolai's five commits fixing > the problem, or revert 6af0883ed977? > > > While looking into this, I noticed that the following commits by > Christian in 4.14 each also cause hangs for me when running the piglit > gpu profile on Tonga: > > 457e0fee04b0 "drm/amdgpu: remove the GART copy hack" > 1d00402b4da2 "drm/amdgpu: fix amdgpu_ttm_bind" > > Are there fixes for these that can be backported to 4.14, or do they > need to be reverted there? Well I'm not aware that any of those two can cause problems. For "drm/amdgpu: remove the GART copy hack" I also don't have the slightest idea how that could be an issue. It just removes an unused code path. Is amd-staging-drm-next stable for you? Thanks, Christian.