On 27/03/17 07:29 PM, Marek Olšák wrote: > On Mar 27, 2017 9:35 AM, "Michel Dänzer" <michel at daenzer.net > <mailto:michel at daenzer.net>> wrote: > > On 25/03/17 01:33 AM, Marek Olšák wrote: > > Hi, > > > > I'm sharing this idea here, because it's something that has been > > decreasing our performance a lot recently, for example: > > > http://openbenchmarking.org/prospect/1703011-RI-RADEONDIR06/7b7668cfc109d1c3dc27e871c8aea71ca13f23fa > <http://openbenchmarking.org/prospect/1703011-RI-RADEONDIR06/7b7668cfc109d1c3dc27e871c8aea71ca13f23fa> > > > > I think the problem there is that Mesa git started uploading > > descriptors and uniforms to VRAM, which helps when TC L2 has a low > > hit/miss ratio, but the performance can randomly drop by an order of > > magnitude. I've heard rumours that kernel 4.11 has an improved > > allocator that should perform better, but the situation is still far > > from ideal. > > > > AMD CPUs and APUs will hopefully suffer less, because we can resize > > the visible VRAM with the help of our CPU hw specs, but Intel CPUs > > will remain limited to 256 MB. The following plan describes how to do > > throttling for visible VRAM evictions. > > > > > > 1) Theory > > > > Initially, the driver doesn't care about where buffers are in VRAM, > > because VRAM buffers are only moved to visible VRAM on CPU page faults > > (when the CPU touches the buffer memory but the memory is in the > > invisible part of VRAM). When it happens, > > amdgpu_bo_fault_reserve_notify is called, which moves the buffer to > > visible VRAM, and the app continues. amdgpu_bo_fault_reserve_notify > > also marks the buffer as contiguous, which makes memory fragmentation > > worse. > > > > I verified this with DiRT Rally where amdgpu_bo_fault_reserve_notify > > was much higher in a CPU profiler than anything else in the kernel. > > > > > > 2) Monitoring via Gallium HUD > > > > We need to expose 2 kernel counters via the INFO ioctl and display > > those via Gallium HUD: > > - The number of VRAM CPU page faults. (the number of calls to > > amdgpu_bo_fault_reserve_notify). > > - The number of bytes moved by ttm_bo_validate inside > > amdgpu_bo_fault_reserve_notify. > > > > This will help us observe what exactly is happening and fine-tune the > > throttling when it's done. > > > > > > 3) Solution > > > > a) When amdgpu_bo_fault_reserve_notify is called, record the fact. > > (amdgpu_bo::had_cpu_page_fault = true) > > > > b) Monitor the MB/s rate at which buffers are moved by > > amdgpu_bo_fault_reserve_notify. If we get above a specific threshold, > > don't move the buffer to visible VRAM. Move it to GTT instead. Note > > that moving to GTT can be cheaper, because moving to visible VRAM is > > likely to evict a lot of buffers there and unmap them from the CPU, > > FWIW, this can be avoided by only setting GTT in busy_placement. Then > TTM will only move the BO to visible VRAM if that can be done without > evicting anything from there. > > > > but moving to GTT shouldn't evict or unmap anything. > > > > c) When we get into the CS ioctl and a buffer has had_cpu_page_fault, > > it can be moved to VRAM if: > > - the GTT->VRAM move rate is low enough to allow it (this is the > > existing throttling mechanism) > > - the visible VRAM move rate is low enough that we will be OK with > > another CPU page fault if it happens. > > Some other ideas that might be worth trying: > > Evicting BOs to GTT instead of moving them to CPU accessible VRAM in > principle in some cases (e.g. for all BOs except those with > AMDGPU_GEM_CREATE_CPU_ACCESS_REQUIRED) or even always. > > > I've tried this and it made things even worse. What exactly did you try? -- Earthling Michel Dänzer | http://www.amd.com Libre software enthusiast | Mesa and X developer