Am 24.06.2017 um 23:50 schrieb John Brooks: > On Sat, Jun 24, 2017 at 08:20:22PM +0200, Christian König wrote: >> Am 24.06.2017 um 01:16 schrieb John Brooks: >>> On Fri, Jun 23, 2017 at 05:02:58PM -0400, Felix Kuehling wrote: >>>> Hi John, >>>> >>>> I haven't read your patches. Just a question based on the cover letter. >>>> >>>> I understand that visible VRAM is the biggest pain point. But could the >>>> same reasoning make sense for invisible VRAM? That is, doing all the >>>> migrations to VRAM in a workqueue? >>>> >>>> Regards, >>>> Felix >>>> >>> I don't see why not. In theory, all non-essential buffer moves could be done >>> this way, and it would be relatively trivial to extend it to that. >>> >>> But I wanted to limit the scope of my changes, at least for this series. >>> Testing takes a long time and I wanted to focus those testing efforts as much >>> as possible, produce something well-tested (I hope), and get feedback on this >>> limited application of the concept before expanding its reach. >> Yeah, sorry I have to say that but the whole approach is utterly nonsense. >> >> What happens is that the delayed BO can only be moved AFTER the command >> submission which wants it to be in VRAM. >> >> So you use the BO in a CS and *then* move it to where the CS wants it to be, >> no matter if the BO is then needed there or not. >> >> Regards, >> Christian. >> > I'm aware of the effect it has. The BO won't be in VRAM for the current command > submission, but it'll be there for a future one. If a BO is used at a given > time, then it's likely it'll be used again soon. Exactly that's the problem here. Keep in mind that BOs can only move in between command submissions. So instead of moving the BO on the first command submission which needs it you most likely move it directly after it. It is actually quite unlikely that a BO which was swapped out will be used for multiple command submissions in a row (well it was swapped out because it was unused for a while). > In which case you'll come out > ahead on latency even if the GPU has to read it from GTT a few times. In any > case, it's never going to hurt as much as full-stop waiting for a worst-case BO > move that needs a lot of evictions. > > Feel free to correct my understanding; you'd certainly know any of this better > than I do. But my tests indicate that immediate forced moves during CS cause > stalls, and the average framerate with delayed moves is the almost (~2%) the > same as with immediate ones, which is about 9% higher than with no forced moves > during CS at all. Your understanding is incorrect. The stalls are not caused by the BO moves itself, but rather by the ping/pong they result in and the intermediate waits this causes. See memory bandwidth is usually not the problem, but swapping a BO out needs to be reflected in all VMs the BO is using. This adds extra dependencies in between command submissions, so instead of rendering and copying at the same time you get everything serialized. I'm pretty sure that this is actually the root cause of the issues you run into here. Beside limiting those ping/pongs to something reasonable per second another possible solution would be to make more room for CPU accessible VRAM BOs. I will take a look into this next week, Felix is kicking me for this for a while anyway. Regards, Christian. > > DiRT Rally average framerates: > With the whole patch set (n=3): > 89.56 > Without it (drm-next-4.13 5ac55629d6b3fcde69f46aa772c6e83be0bdcbbf) > (n=3): > 91.16 (+stalls) > Patches 1 and 3 only, and with GTT set as the only busy placement for > CPU_ACCESS_REQUIRED BOs in amdgpu_cs_bo_validate (n=3): > 82.15 > > John > >>> John >>> >>>> On 17-06-23 01:39 PM, John Brooks wrote: >>>>> This patch series is intended to improve performance when limited CPU-visible >>>>> VRAM is under pressure. >>>>> >>>>> Moving BOs into visible VRAM is essentially a housekeeping task. It's faster to >>>>> access them in VRAM than GTT, but it isn't a hard requirement for them to be in >>>>> VRAM. As such, it is unnecessary to spend valuable time blocking on this in the >>>>> page fault handler or during command submission. Doing so translates directly >>>>> into a longer frame time (ergo stalls and stuttering). >>>>> >>>>> The problem worsens when attempting to move BOs into visible VRAM when it is >>>>> full. This takes much longer than a simple move because other BOs have to be >>>>> evicted, which involves finding and then moving potentially hundreds of other >>>>> BOs, which is very time consuming. In the case of limited visible VRAM, it's >>>>> important to do this sometime to keep the contents of visible VRAM fresh, but >>>>> it does not need to be a blocking operation. If visible VRAM is full, the BO >>>>> can be read from GTT in the meantime and the BO can be moved to VRAM later. >>>>> >>>>> Thus, I have made it so that neither the command submission code nor page fault >>>>> handler spends time evicting BOs from visible VRAM, and instead this is >>>>> deferred to a workqueue function that's queued when CS requests BOs flagged >>>>> AMDGPU_GEM_CREATE_CPU_ACCESS_REQUIRED. >>>>> >>>>> Speaking of CPU_ACCESS_REQUIRED, I've changed the handling of that flag so that >>>>> the kernel driver can clear it later even if it was set by userspace. This is >>>>> because the userspace graphics library can't know whether the application >>>>> really needs it to be CPU_ACCESS_REQUIRED forever. The kernel driver can't know >>>>> either, but it does know when page faults occur, and if a BO doesn't appear to >>>>> have any page faults when it's moved somewhere inaccessible, the flag can be >>>>> removed and it doesn't have to take up space in CPU-visible memory anymore. >>>>> This change was based on IRC discussions with Michel. >>>>> >>>>> Patch 7 fixes a problem with BO moverate throttling that causes visible VRAM >>>>> moves to not be throttled if total VRAM isn't full enough. >>>>> >>>>> I've also added a vis_vramlimit module parameter for debugging purposes. It's >>>>> similar to the vramlimit parameter except it limits only visible VRAM. >>>>> >>>>> I have tested this patch set with the two games I know to be affected by >>>>> visible VRAM pressure: DiRT Rally and Dying Light. It practically eliminates >>>>> eviction-related stuttering in DiRT Rally as well as very low performance if >>>>> visible VRAM is limited to 64MB. It also fixes severely low framerates that >>>>> occurred in some areas of Dying Light. All my testing was done with an R9 290 >>>>> with 4GB of visible VRAM with an Intel i7 4790. >>>>> >>>>> -- >>>>> John Brooks (Frogging101) >>>>> >>>>> _______________________________________________ >>>>> amd-gfx mailing list >>>>> amd-gfx at lists.freedesktop.org >>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx >>