[CC Kent FYI] On 16-08-11 04:31 PM, Deucher, Alexander wrote: >> -----Original Message----- >> From: amd-gfx [mailto:amd-gfx-bounces at lists.freedesktop.org] On Behalf >> Of Felix Kuehling >> Sent: Thursday, August 11, 2016 3:52 PM >> To: Michel Dänzer; Christian König >> Cc: amd-gfx at lists.freedesktop.org >> Subject: Reverted another change to fix buffer move hangs (was Re: >> [PATCH] drm/ttm: partial revert "cleanup ttm_tt_(unbind|destroy)" v2) >> >> We had to revert another change on the KFD branch to fix a buffer move >> problem: 8b6b79f43801f00ddcdc10a4d5719eba4b2e32aa (drm/amdgpu: >> group BOs >> by log2 of the size on the LRU v2 > That makes sense. I think you may want a different LRU scheme for KFD or at least special handling for KFD buffers. [FK] But I think the patch shouldn't cause hangs, regardless. I eventually found what the problem was. The "group BOs by log2 of the size on the LRU v2" patch exposed a latent bug related to the GART size. On our KFD branch, we calculate the GART size differently, and it can easily go above 4GB. I think on amd-staging-4.6 the GART size can also go above 4GB on cards with lots of VRAM. However, the offset parameter in amdgpu_gart_bind and unbind is only 32-bit. With the patch our test ended up using GART offsets beyond 4GB for the first time. Changing the offset parameter to uint64_t fixes the problem. Our test also demonstrates a potential flaw in the log2 grouping patch: When a buffer of a previously unused size is added to the LRU, it gets added to the front of the list, rather than the tail. So an application that allocates a very large buffer after a bunch of smaller buffers, is very likely to have that buffer evicted over and over again before any smaller buffers are considered for eviction. I believe, this can result in thrashing of large buffers. Some other observations: When the last BO of a given size is removed from the LRU list, the LRU tail for that size is left "floating" in the middle of the LRU list. So the next BO of that size that is added, will be added at an arbitrary position in the list. It may even end up in the middle of a block of pages of a different size. So a log2 grouping may end up being split. Regards, Felix