Reverted another change to fix buffer move hangs (was Re: [PATCH] drm/ttm: partial revert "cleanup ttm_tt_(unbind|destroy)" v2)

deathsimple@xxxxxxxxxxx (Christian König) · Tue, 16 Aug 2016 09:53:54 +0200

Am 15.08.2016 um 23:03 schrieb Alex Deucher:
> On Mon, Aug 15, 2016 at 3:06 PM, Felix Kuehling <felix.kuehling at amd.com> wrote:
>> Patch against current amd-staging-4.6 is attached.
> Reviewed-by: Alex Deucher <alexander.deucher at amd.com>

Reviewed-by: Christian KÃ¶nig <christian.koenig at amd.com>.

>
>> Regards,
>>    Felix
>>
>>
>> On 16-08-13 05:25 AM, Christian KÃ¶nig wrote:
>>> Am 13.08.2016 um 01:22 schrieb Felix Kuehling:
>>>> [CC Kent FYI]
>>>>
>>>> On 16-08-11 04:31 PM, Deucher, Alexander wrote:
>>>>>> -----Original Message-----
>>>>>> From: amd-gfx [mailto:amd-gfx-bounces at lists.freedesktop.org] On Behalf
>>>>>> Of Felix Kuehling
>>>>>> Sent: Thursday, August 11, 2016 3:52 PM
>>>>>> To: Michel DÃ¤nzer; Christian KÃ¶nig
>>>>>> Cc: amd-gfx at lists.freedesktop.org
>>>>>> Subject: Reverted another change to fix buffer move hangs (was Re:
>>>>>> [PATCH] drm/ttm: partial revert "cleanup ttm_tt_(unbind|destroy)" v2)
>>>>>>
>>>>>> We had to revert another change on the KFD branch to fix a buffer move
>>>>>> problem: 8b6b79f43801f00ddcdc10a4d5719eba4b2e32aa (drm/amdgpu:
>>>>>> group BOs
>>>>>> by log2 of the size on the LRU v2
>>>>> That makes sense.  I think you may want a different LRU scheme for
>>>>> KFD or at least special handling for KFD buffers.
>>>> [FK] But I think the patch shouldn't cause hangs, regardless.
>>>>
>>>> I eventually found what the problem was. The "group BOs by log2 of the
>>>> size on the LRU v2" patch exposed a latent bug related to the GART size.
>>>> On our KFD branch, we calculate the GART size differently, and it can
>>>> easily go above 4GB. I think on amd-staging-4.6 the GART size can also
>>>> go above 4GB on cards with lots of VRAM.
>>>>
>>>> However, the offset parameter in amdgpu_gart_bind and unbind is only
>>>> 32-bit. With the patch our test ended up using GART offsets beyond 4GB
>>>> for the first time. Changing the offset parameter to uint64_t fixes the
>>>> problem.
>>> Nice catch, please provide a patch to fix this.
>>>
>>>> Our test also demonstrates a potential flaw in the log2 grouping patch:
>>>> When a buffer of a previously unused size is added to the LRU, it gets
>>>> added to the front of the list, rather than the tail. So an application
>>>> that allocates a very large buffer after a bunch of smaller buffers, is
>>>> very likely to have that buffer evicted over and over again before any
>>>> smaller buffers are considered for eviction. I believe, this can result
>>>> in thrashing of large buffers.
>>>>
>>>> Some other observations: When the last BO of a given size is removed
>>>> from the LRU list, the LRU tail for that size is left "floating" in the
>>>> middle of the LRU list. So the next BO of that size that is added, will
>>>> be added at an arbitrary position in the list. It may even end up in the
>>>> middle of a block of pages of a different size. So a log2 grouping may
>>>> end up being split.
>>> Yeah, those are more or less known issues.
>>>
>>> Keep in mind that we only added the grouping by log2 of the size to
>>> have a justification to push the TTM changes upstream for the coming
>>> KFD fences.
>>>
>>> E.g. so that we are able to have this upstream before we try to push
>>> on the fence code.
>>>
>>> I will take a look at fixing those issues when I have time, shouldn't
>>> be to complicated to set the entries to zero when they aren't used or
>>> adjust other entries as well when some are added.
>>>
>>> Regards,
>>> Christian.
>>>
>>>> Regards,
>>>>     Felix
>>>>
>>>> _______________________________________________
>>>> amd-gfx mailing list
>>>> amd-gfx at lists.freedesktop.org
>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>
>>
>> _______________________________________________
>> amd-gfx mailing list
>> amd-gfx at lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>