On Fri, Mar 24, 2017 at 5:45 PM, Christian König <deathsimple at vodafone.de> wrote: > Am 24.03.2017 um 17:33 schrieb Marek Olšák: >> >> Hi, >> >> I'm sharing this idea here, because it's something that has been >> decreasing our performance a lot recently, for example: >> >> http://openbenchmarking.org/prospect/1703011-RI-RADEONDIR06/7b7668cfc109d1c3dc27e871c8aea71ca13f23fa >> >> I think the problem there is that Mesa git started uploading >> descriptors and uniforms to VRAM, which helps when TC L2 has a low >> hit/miss ratio, but the performance can randomly drop by an order of >> magnitude. I've heard rumours that kernel 4.11 has an improved >> allocator that should perform better, but the situation is still far >> from ideal. >> >> AMD CPUs and APUs will hopefully suffer less, because we can resize >> the visible VRAM with the help of our CPU hw specs, but Intel CPUs >> will remain limited to 256 MB. The following plan describes how to do >> throttling for visible VRAM evictions. >> >> >> 1) Theory >> >> Initially, the driver doesn't care about where buffers are in VRAM, >> because VRAM buffers are only moved to visible VRAM on CPU page faults >> (when the CPU touches the buffer memory but the memory is in the >> invisible part of VRAM). When it happens, >> amdgpu_bo_fault_reserve_notify is called, which moves the buffer to >> visible VRAM, and the app continues. amdgpu_bo_fault_reserve_notify >> also marks the buffer as contiguous, which makes memory fragmentation >> worse. >> >> I verified this with DiRT Rally where amdgpu_bo_fault_reserve_notify >> was much higher in a CPU profiler than anything else in the kernel. > > > Good to know that my expectations on this are correct. > > How about fixing the need for contiguous buffers when CPU mapping them? > > That should actually be pretty easy to do. > >> 2) Monitoring via Gallium HUD >> >> We need to expose 2 kernel counters via the INFO ioctl and display >> those via Gallium HUD: >> - The number of VRAM CPU page faults. (the number of calls to >> amdgpu_bo_fault_reserve_notify). >> - The number of bytes moved by ttm_bo_validate inside >> amdgpu_bo_fault_reserve_notify. >> >> This will help us observe what exactly is happening and fine-tune the >> throttling when it's done. >> >> >> 3) Solution >> >> a) When amdgpu_bo_fault_reserve_notify is called, record the fact. >> (amdgpu_bo::had_cpu_page_fault = true) > > > What is that good for? > >> b) Monitor the MB/s rate at which buffers are moved by >> amdgpu_bo_fault_reserve_notify. If we get above a specific threshold, >> don't move the buffer to visible VRAM. Move it to GTT instead. Note >> that moving to GTT can be cheaper, because moving to visible VRAM is >> likely to evict a lot of buffers there and unmap them from the CPU, >> but moving to GTT shouldn't evict or unmap anything. > > > Yeah, had that idea as well. I've been working on adding a context to TTMs > BO validation call chain. > > This way we could add a byte limit on how much TTM will try to evict before > returning -ENOMEM (or better ENOSPC). > >> c) When we get into the CS ioctl and a buffer has had_cpu_page_fault, >> it can be moved to VRAM if: >> - the GTT->VRAM move rate is low enough to allow it (this is the >> existing throttling mechanism) >> - the visible VRAM move rate is low enough that we will be OK with >> another CPU page fault if it happens. > > > Interesting idea, need to think a bit about it. > > But I would say this has second priority, fixing the contiguous buffer > requirement should be first. Going to work on that next. Interesting. I didn't know the contiguous setting wasn't required. Marek