For the record, Michel's patch "drm/ttm: Wait for a BO to become idle before unbinding it from GTT" fixes our KFD problem as well. Thanks, Felix On 16-07-27 05:27 PM, Felix Kuehling wrote: > We're also looking into a hang with a KFD unit test that allocates lots > of memory and fragments it deliberately, without mapping it all at once. > It's a new problem for us as we're rebasing on amd-staging-4.6. > Something weird seems to be happening with evictions, but I haven't been > able to figure it out. > > I was able to see that SDMA page table updates stop working at some > point, though SDMA fences are still signaling. If I let the test run > longer, SDMA and CP hang. I dumped the SDMA IBs and didn't see anything > suspicious. My guess was that maybe the SDMA IBs or the ring are getting > corrupted, or maybe the GART table entries for the IBs or ring are > corrupted. But I haven't been able to prove that or track it down to a > root cause. We're now trying to reimplement the test using libdrm-amdgpu > APIs so we can bisect on the amd-staging-4.6 branch without KFD. > > Regards, > Felix > > On 16-07-26 10:26 PM, Michel Dänzer wrote: >> On 22.07.2016 22:10, Christian König wrote: >>> From: Christian König <christian.koenig at amd.com> >>> >>> We still need to unbind explicitely during a move. >> This change fixed a hang for me when running the piglit test >> max-texture-size with the radeon driver on Kaveri. >> >> However, there's still a similar hang left when letting the piglit test >> tex3d-maxsize run concurrently with other tests (running tex3d-maxsize >> alone doesn't hang, but fails due to running out of GPU memory; that's a >> recent radeonsi regression). There are >> >> [TTM] Buffer eviction failed >> >> messages in dmesg shortly before the hang. >> >> I haven't seen such hangs with older kernels. Any ideas offhand what the >> problem could be? If not, I'll try bisecting. >> >>