Plan: BO move throttling for visible VRAM evictions

maraeo@xxxxxxxxx (Marek Olšák) · Fri, 24 Mar 2017 17:57:11 +0100

On Fri, Mar 24, 2017 at 5:45 PM, Christian KÃ¶nig
<deathsimple at vodafone.de> wrote:
> Am 24.03.2017 um 17:33 schrieb Marek OlÅ¡Ã¡k:
>>
>> Hi,
>>
>> I'm sharing this idea here, because it's something that has been
>> decreasing our performance a lot recently, for example:
>>
>> http://openbenchmarking.org/prospect/1703011-RI-RADEONDIR06/7b7668cfc109d1c3dc27e871c8aea71ca13f23fa
>>
>> I think the problem there is that Mesa git started uploading
>> descriptors and uniforms to VRAM, which helps when TC L2 has a low
>> hit/miss ratio, but the performance can randomly drop by an order of
>> magnitude. I've heard rumours that kernel 4.11 has an improved
>> allocator that should perform better, but the situation is still far
>> from ideal.
>>
>> AMD CPUs and APUs will hopefully suffer less, because we can resize
>> the visible VRAM with the help of our CPU hw specs, but Intel CPUs
>> will remain limited to 256 MB. The following plan describes how to do
>> throttling for visible VRAM evictions.
>>
>>
>> 1) Theory
>>
>> Initially, the driver doesn't care about where buffers are in VRAM,
>> because VRAM buffers are only moved to visible VRAM on CPU page faults
>> (when the CPU touches the buffer memory but the memory is in the
>> invisible part of VRAM). When it happens,
>> amdgpu_bo_fault_reserve_notify is called, which moves the buffer to
>> visible VRAM, and the app continues. amdgpu_bo_fault_reserve_notify
>> also marks the buffer as contiguous, which makes memory fragmentation
>> worse.
>>
>> I verified this with DiRT Rally where amdgpu_bo_fault_reserve_notify
>> was much higher in a CPU profiler than anything else in the kernel.
>
>
> Good to know that my expectations on this are correct.
>
> How about fixing the need for contiguous buffers when CPU mapping them?
>
> That should actually be pretty easy to do.
>
>> 2) Monitoring via Gallium HUD
>>
>> We need to expose 2 kernel counters via the INFO ioctl and display
>> those via Gallium HUD:
>> - The number of VRAM CPU page faults. (the number of calls to
>> amdgpu_bo_fault_reserve_notify).
>> - The number of bytes moved by ttm_bo_validate inside
>> amdgpu_bo_fault_reserve_notify.
>>
>> This will help us observe what exactly is happening and fine-tune the
>> throttling when it's done.
>>
>>
>> 3) Solution
>>
>> a) When amdgpu_bo_fault_reserve_notify is called, record the fact.
>> (amdgpu_bo::had_cpu_page_fault = true)
>
>
> What is that good for?
>
>> b) Monitor the MB/s rate at which buffers are moved by
>> amdgpu_bo_fault_reserve_notify. If we get above a specific threshold,
>> don't move the buffer to visible VRAM. Move it to GTT instead. Note
>> that moving to GTT can be cheaper, because moving to visible VRAM is
>> likely to evict a lot of buffers there and unmap them from the CPU,
>> but moving to GTT shouldn't evict or unmap anything.
>
>
> Yeah, had that idea as well. I've been working on adding a context to TTMs
> BO validation call chain.
>
> This way we could add a byte limit on how much TTM will try to evict before
> returning -ENOMEM (or better ENOSPC).
>
>> c) When we get into the CS ioctl and a buffer has had_cpu_page_fault,
>> it can be moved to VRAM if:
>> - the GTT->VRAM move rate is low enough to allow it (this is the
>> existing throttling mechanism)
>> - the visible VRAM move rate is low enough that we will be OK with
>> another CPU page fault if it happens.
>
>
> Interesting idea, need to think a bit about it.
>
> But I would say this has second priority, fixing the contiguous buffer
> requirement should be first. Going to work on that next.

Interesting. I didn't know the contiguous setting wasn't required.

Marek