Quoting Michal Hocko (2020-06-25 08:57:25) > On Wed 24-06-20 20:14:17, Chris Wilson wrote: > > A general rule of thumb is that shrinkers should be fast and effective. > > They are called from direct reclaim at the most incovenient of times when > > the caller is waiting for a page. If we attempt to reclaim a page being > > pinned for active dma [pin_user_pages()], we will incur far greater > > latency than a normal anonymous page mapped multiple times. Worse the > > page may be in use indefinitely by the HW and unable to be reclaimed > > in a timely manner. > > > > A side effect of the LRU shrinker not being dma aware is that we will > > often attempt to perform direct reclaim on the persistent group of dma > > pages while continuing to use the dma HW (an issue as the HW may already > > be actively waiting for the next user request), and even attempt to > > reclaim a partially allocated dma object in order to satisfy pinning > > the next user page for that object. > > You are talking about direct reclaim but this path is shared with the > background reclaim. This is a bit confusing. Maybe you just want to > outline the latency in the reclaim which is more noticeable in the > direct reclaim to the userspace. This would be good to be clarified. > > How much memory are we talking about here btw? It depends. In theory, it is used sparingly. But it is under userspace control, exposed via Vulkan, OpenGL, OpenCL, media and even old XShm. If all goes to plan the application memory is only pinned for as long as the HW is using it, but that is an indefinite period of time and an indefinite amount of memory. There are provisions in place to impose upper limits on how long an operation can last on the HW, and the mmu-notifier is there to ensure we do unpin the memory on demand. However cancelling a HW operation (which will result in data loss and often process termination due to an unfortunate sequence of events when userspace fails to recover) for a try_to_unmap on behalf of the LRU shrinker is not a good choice. > > It is to be expected that such pages are made available for reclaim at > > the end of the dma operation [unpin_user_pages()], and for truly > > longterm pins to be proactively recovered via device specific shrinkers > > [i.e. stop the HW, allow the pages to be returned to the system, and > > then compete again for the memory]. > > Is the later implemented? Depends on driver, for i915 we had a shrinker since before we introduced get_user_pages objects. We have the same problem with trying to mitigate userspace wanting to use all of memory for a single operation, whether it's from shmemfs or get_user_pages. > Btw. overall intention of the patch is not really clear to me. Do I get > it right that this is going to reduce latency of the reclaim for pages > that are not reclaimable anyway because they are pinned? If yes do we > have any numbers for that. I can plug it into a microbenchmark ala cycletest to show the impact... Memory filled with 64M gup objects, random utilisation of those with the GPU; background process filling the pagecache with find /; reporting the time difference from the expected expiry of a timer with the actual: [On a Geminilake Atom-class processor with 8GiB, average of 5 runs, each measuring mean latency for 20s -- mean is probably a really bad choice here, we need 50/90/95/99] direct reclaim calling mmu-notifier: gem_syslatency: cycles=2122, latency mean=1601.185us max=33572us skipping try_to_unmap_one with page_maybe_dma_pinned: gem_syslatency: cycles=1965, latency mean=597.971us max=28462us Baseline (background find /; application touched all memory, but no HW ops) gem_syslatency: cycles=0, latency mean=6.695us max=77us Compare with the time to allocate a single THP against load: Baseline: gem_syslatency: cycles=0, latency mean=1541.562us max=52196us Direct reclaim calling mmu-notifier: gem_syslatency: cycles=2115, latency mean=9050.930us max=396986us page_maybe_dma_pinned skip: gem_syslatency: cycles=2325, latency mean=7431.633us max=187960us Take with a massive pinch of salt. I expect, once I find the right sequence, to reliably control the induced latency on the RT thread. But first, I have to look at why there's a correlation with HW load and timer latency, even with steady state usage. That's quite surprising -- ah, I had it left to PREEMPT_VOLUNTARY and this machine has to scan every request submitted to HW. Just great. With PREEMPT: Timer: Base: gem_syslatency: cycles=0, latency mean=8.823us max=83us Reclaim: gem_syslatency: cycles=2224, latency mean=79.308us max=4805us Skip: gem_syslatency: cycles=2677, latency mean=70.306us max=4720us THP: Base: gem_syslatency: cycles=0, latency mean=1993.693us max=201958us Reclaim: gem_syslatency: cycles=1284, latency mean=2873.633us max=295962us Skip: gem_syslatency: cycles=1809, latency mean=1991.509us max=261050us Earlier caveats notwithstanding; confidence in results still low. And refine the testing somewhat, if at the very least gather enough samples for credible statistics. > It would be also good to explain why the bail out is implemented in > try_to_unmap rather than shrink_shrink_page_list. I'm in the process of working up the chain, having started with trying to circumvent the wait for reclaim in the mmu notifier callback in the driver. -Chris _______________________________________________ Intel-gfx mailing list Intel-gfx@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/intel-gfx