We had to revert another change on the KFD branch to fix a buffer move problem: 8b6b79f43801f00ddcdc10a4d5719eba4b2e32aa (drm/amdgpu: group BOs by log2 of the size on the LRU v2 We haven't looked into this change in detail yet, to understand the cause. Kent found it by bisecting on amd-staging-4.6 and applying KFD changes on top. Regards, Felix On 16-08-05 11:06 AM, Felix Kuehling wrote: > For the record, Michel's patch "drm/ttm: Wait for a BO to become idle > before unbinding it from GTT" fixes our KFD problem as well. > > Thanks, > Felix > > On 16-07-27 05:27 PM, Felix Kuehling wrote: >> We're also looking into a hang with a KFD unit test that allocates lots >> of memory and fragments it deliberately, without mapping it all at once. >> It's a new problem for us as we're rebasing on amd-staging-4.6. >> Something weird seems to be happening with evictions, but I haven't been >> able to figure it out. >> >> I was able to see that SDMA page table updates stop working at some >> point, though SDMA fences are still signaling. If I let the test run >> longer, SDMA and CP hang. I dumped the SDMA IBs and didn't see anything >> suspicious. My guess was that maybe the SDMA IBs or the ring are getting >> corrupted, or maybe the GART table entries for the IBs or ring are >> corrupted. But I haven't been able to prove that or track it down to a >> root cause. We're now trying to reimplement the test using libdrm-amdgpu >> APIs so we can bisect on the amd-staging-4.6 branch without KFD. >> >> Regards, >> Felix >> >> On 16-07-26 10:26 PM, Michel Dänzer wrote: >>> On 22.07.2016 22:10, Christian König wrote: >>>> From: Christian König <christian.koenig at amd.com> >>>> >>>> We still need to unbind explicitely during a move. >>> This change fixed a hang for me when running the piglit test >>> max-texture-size with the radeon driver on Kaveri. >>> >>> However, there's still a similar hang left when letting the piglit test >>> tex3d-maxsize run concurrently with other tests (running tex3d-maxsize >>> alone doesn't hang, but fails due to running out of GPU memory; that's a >>> recent radeonsi regression). There are >>> >>> [TTM] Buffer eviction failed >>> >>> messages in dmesg shortly before the hang. >>> >>> I haven't seen such hangs with older kernels. Any ideas offhand what the >>> problem could be? If not, I'll try bisecting. >>> >>>