Plan: BO move throttling for visible VRAM evictions

david1.zhou@xxxxxxx (zhoucm1) · Mon, 27 Mar 2017 17:36:01 +0800



On 2017å¹´03æ??27æ?¥ 17:29, Christian KÃ¶nig wrote:
> On APUs I've already enabled using direct access to the stolen parts 
> of system memory.
Thanks, could you point me out where is doing this?

Regards,
David Zhou
>
> So there won't be any eviction any more because of page faults on APUs.
>
> Regards,
> Christian.
>
> Am 27.03.2017 um 09:53 schrieb Zhou, David(ChunMing):
>> For APU special case, can we prevent eviction happening between VRAM 
>> <----> GTT?
>>
>> Regards,
>> David Zhou
>>
>> -----Original Message-----
>> From: amd-gfx [mailto:amd-gfx-bounces at lists.freedesktop.org] On 
>> Behalf Of Michel D?nzer
>> Sent: Monday, March 27, 2017 3:36 PM
>> To: Marek OlÅ¡Ã¡k <maraeo at gmail.com>
>> Cc: amd-gfx mailing list <amd-gfx at lists.freedesktop.org>
>> Subject: Re: Plan: BO move throttling for visible VRAM evictions
>>
>> On 25/03/17 01:33 AM, Marek OlÅ¡Ã¡k wrote:
>>> Hi,
>>>
>>> I'm sharing this idea here, because it's something that has been
>>> decreasing our performance a lot recently, for example:
>>> http://openbenchmarking.org/prospect/1703011-RI-RADEONDIR06/7b7668cfc1
>>> 09d1c3dc27e871c8aea71ca13f23fa
>>>
>>> I think the problem there is that Mesa git started uploading
>>> descriptors and uniforms to VRAM, which helps when TC L2 has a low
>>> hit/miss ratio, but the performance can randomly drop by an order of
>>> magnitude. I've heard rumours that kernel 4.11 has an improved
>>> allocator that should perform better, but the situation is still far
>>> from ideal.
>>>
>>> AMD CPUs and APUs will hopefully suffer less, because we can resize
>>> the visible VRAM with the help of our CPU hw specs, but Intel CPUs
>>> will remain limited to 256 MB. The following plan describes how to do
>>> throttling for visible VRAM evictions.
>>>
>>>
>>> 1) Theory
>>>
>>> Initially, the driver doesn't care about where buffers are in VRAM,
>>> because VRAM buffers are only moved to visible VRAM on CPU page faults
>>> (when the CPU touches the buffer memory but the memory is in the
>>> invisible part of VRAM). When it happens,
>>> amdgpu_bo_fault_reserve_notify is called, which moves the buffer to
>>> visible VRAM, and the app continues. amdgpu_bo_fault_reserve_notify
>>> also marks the buffer as contiguous, which makes memory fragmentation
>>> worse.
>>>
>>> I verified this with DiRT Rally where amdgpu_bo_fault_reserve_notify
>>> was much higher in a CPU profiler than anything else in the kernel.
>>>
>>>
>>> 2) Monitoring via Gallium HUD
>>>
>>> We need to expose 2 kernel counters via the INFO ioctl and display
>>> those via Gallium HUD:
>>> - The number of VRAM CPU page faults. (the number of calls to
>>> amdgpu_bo_fault_reserve_notify).
>>> - The number of bytes moved by ttm_bo_validate inside
>>> amdgpu_bo_fault_reserve_notify.
>>>
>>> This will help us observe what exactly is happening and fine-tune the
>>> throttling when it's done.
>>>
>>>
>>> 3) Solution
>>>
>>> a) When amdgpu_bo_fault_reserve_notify is called, record the fact.
>>> (amdgpu_bo::had_cpu_page_fault = true)
>>>
>>> b) Monitor the MB/s rate at which buffers are moved by
>>> amdgpu_bo_fault_reserve_notify. If we get above a specific threshold,
>>> don't move the buffer to visible VRAM. Move it to GTT instead. Note
>>> that moving to GTT can be cheaper, because moving to visible VRAM is
>>> likely to evict a lot of buffers there and unmap them from the CPU,
>> FWIW, this can be avoided by only setting GTT in busy_placement. Then 
>> TTM will only move the BO to visible VRAM if that can be done without 
>> evicting anything from there.
>>
>>
>>> but moving to GTT shouldn't evict or unmap anything.
>>>
>>> c) When we get into the CS ioctl and a buffer has had_cpu_page_fault,
>>> it can be moved to VRAM if:
>>> - the GTT->VRAM move rate is low enough to allow it (this is the
>>> existing throttling mechanism)
>>> - the visible VRAM move rate is low enough that we will be OK with
>>> another CPU page fault if it happens.
>> Some other ideas that might be worth trying:
>>
>> Evicting BOs to GTT instead of moving them to CPU accessible VRAM in 
>> principle in some cases (e.g. for all BOs except those with
>> AMDGPU_GEM_CREATE_CPU_ACCESS_REQUIRED) or even always.
>>
>> Implementing eviction from CPU visible to CPU invisible VRAM, similar 
>> to how it's done in radeon. Note that there's potential for userspace 
>> triggering an infinite loop in the kernel in cases where BOs are 
>> moved back from invisible to visible VRAM on page faults.
>>
>>
>