Re: [RFC PATCH 00/18] TTM interface for managing VRAM oversubscription

Marek Olšák <maraeo@xxxxxxxxx> · Thu, 25 Apr 2024 09:22:55 -0400

The most extreme ping-ponging is mitigated by throttling buffer moves
in the kernel, but it only works without VM_ALWAYS_VALID and you can
set BO priorities in the BO list. A better approach that works with
VM_ALWAYS_VALID would be nice.

Marek

On Wed, Apr 24, 2024 at 1:12 PM Friedrich Vock <friedrich.vock@xxxxxx> wrote:
>
> Hi everyone,
>
> recently I've been looking into remedies for apps (in particular, newer
> games) that experience significant performance loss when they start to
> hit VRAM limits, especially on older or lower-end cards that struggle
> to fit both desktop apps and all the game data into VRAM at once.
>
> The root of the problem lies in the fact that from userspace's POV,
> buffer eviction is very opaque: Userspace applications/drivers cannot
> tell how oversubscribed VRAM is, nor do they have fine-grained control
> over which buffers get evicted.  At the same time, with GPU APIs becoming
> increasingly lower-level and GPU-driven, only the application itself
> can know which buffers are used within a particular submission, and
> how important each buffer is. For this, GPU APIs include interfaces
> to query oversubscription and specify memory priorities: In Vulkan,
> oversubscription can be queried through the VK_EXT_memory_budget
> extension. Different buffers can also be assigned priorities via the
> VK_EXT_pageable_device_local_memory extension. Modern games, especially
> D3D12 games via vkd3d-proton, rely on oversubscription being reported and
> priorities being respected in order to perform their memory management.
>
> However, relaying this information to the kernel via the current KMD uAPIs
> is not possible. On AMDGPU for example, all work submissions include a
> "bo list" that contains any buffer object that is accessed during the
> course of the submission. If VRAM is oversubscribed and a buffer in the
> list was evicted to system memory, that buffer is moved back to VRAM
> (potentially evicting other unused buffers).
>
> Since the usermode driver doesn't know what buffers are used by the
> application, its only choice is to submit a bo list that contains every
> buffer the application has allocated. In case of VRAM oversubscription,
> it is highly likely that some of the application's buffers were evicted,
> which almost guarantees that some buffers will get moved around. Since
> the bo list is only known at submit time, this also means the buffers
> will get moved right before submitting application work, which is the
> worst possible time to move buffers from a latency perspective. Another
> consequence of the large bo list is that nearly all memory from other
> applications will be evicted, too. When different applications (e.g. game
> and compositor) submit work one after the other, this causes a ping-pong
> effect where each app's submission evicts the other app's memory,
> resulting in a large amount of unnecessary moves.
>
> This overly aggressive eviction behavior led to RADV adopting a change
> that effectively allows all VRAM applications to reside in system memory
> [1].  This worked around the ping-ponging/excessive buffer moving problem,
> but also meant that any memory evicted to system memory would forever
> stay there, regardless of how VRAM is used.
>
> My proposal aims at providing a middle ground between these extremes.
> The goals I want to meet are:
> - Userspace is accurately informed about VRAM oversubscription/how much
>   VRAM has been evicted
> - Buffer eviction respects priorities set by userspace - Wasteful
>   ping-ponging is avoided to the extent possible
>
> I have been testing out some prototypes, and came up with this rough
> sketch of an API:
>
> - For each ttm_resource_manager, the amount of evicted memory is tracked
>   (similarly to how "usage" tracks the memory usage). When memory is
>   evicted via ttm_bo_evict, the size of the evicted memory is added, when
>   memory is un-evicted (see below), its size is subtracted. The amount of
>   evicted memory for e.g. VRAM can be queried by userspace via an ioctl.
>
> - Each ttm_resource_manager maintains a list of evicted buffer objects.
>
> - ttm_mem_unevict walks the list of evicted bos for a given
>   ttm_resource_manager and tries moving evicted resources back. When a
>   buffer is freed, this function is called to immediately restore some
>   evicted memory.
>
> - Each ttm_buffer_object independently tracks the mem_type it wants
>   to reside in.
>
> - ttm_bo_try_unevict is added as a helper function which attempts to
>   move the buffer to its preferred mem_type. If no space is available
>   there, it fails with -ENOSPC/-ENOMEM.
>
> - Similar to how ttm_bo_evict works, each driver can implement
>   uneviction_valuable/unevict_flags callbacks to control buffer
>   un-eviction.
>
> This is what patches 1-10 accomplish (together with an amdgpu
> implementation utilizing the new API).
>
> Userspace priorities could then be implemented as follows:
>
> - TTM already manages priorities for each buffer object. These priorities
>   can be updated by userspace via a GEM_OP ioctl to inform the kernel
>   which buffers should be evicted before others. If an ioctl increases
>   the priority of a buffer, ttm_bo_try_unevict is called on that buffer to
>   try and move it back (potentially evicting buffers with a lower
>   priority)
>
> - Buffers should never be evicted by other buffers with equal/lower
>   priority, but if there is a buffer with lower priority occupying VRAM,
>   it should be evicted in favor of the higher-priority one. This prevents
>   ping-ponging between buffers that try evicting each other and is
>   trivially implementable with an early-exit in ttm_mem_evict_first.
>
> This is covered in patches 11-15, with the new features exposed to
> userspace in patches 16-18.
>
> I also have a RADV branch utilizing this API at [2], which I use for
> testing.
>
> This implementation is stil very much WIP, although the D3D12 games I
> tested already seemed to benefit from it. Nevertheless, are still quite
> a few TODOs and unresolved questions/problems.
>
> Some kernel drivers (e.g i915) already use TTM priorities for
> kernel-internal purposes. Of course, some of the highest priorities
> should stay reserved for these purposes (with userspace being able to
> use the lower priorities).
>
> Another problem with priorities is the possibility of apps starving other
> apps by occupying all of VRAM with high-priority allocations. A possible
> solution could be include restricting the highest priority/priorities
> to important apps like compositors.
>
> Tying into this problem, only apps that are actively cooperating
> to reduce memory pressure can benefit from the current memory priority
> implementation. Eventually the priority system could also be utilized
> to benefit all applications, for example with the desktop environment
> boosting the priority of the currently-focused app/its cgroup (to
> provide the best QoS to the apps the user is actively using). A full
> implementation of this is probably out-of-scope for this initial proposal,
> but it's probably a good idea to consider this as a possible future use
> of the priority API.
>
> I'm primarily looking to integrate this into amdgpu to solve the
> issues I've seen there, but I'm also interested in feedback from
> other drivers. Is this something you'd be interested in? Do you
> have any objections/comments/questions about my proposed design?
>
> Thanks,
> Friedrich
>
> [1] https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/6833
> [2] https://gitlab.freedesktop.org/pixelcluster/mesa/-/tree/spilling
>
> Friedrich Vock (18):
>   drm/ttm: Add tracking for evicted memory
>   drm/ttm: Add per-BO eviction tracking
>   drm/ttm: Implement BO eviction tracking
>   drm/ttm: Add driver funcs for uneviction control
>   drm/ttm: Add option to evict no BOs in operation
>   drm/ttm: Add public buffer eviction/uneviction functions
>   drm/amdgpu: Add TTM uneviction control functions
>   drm/amdgpu: Don't try moving BOs to preferred domain before submit
>   drm/amdgpu: Don't mark VRAM as a busy placement for VRAM|GTT resources
>   drm/amdgpu: Don't add GTT to initial domains after failing to allocate
>     VRAM
>   drm/ttm: Bump BO priority count
>   drm/ttm: Do not evict BOs with higher priority
>   drm/ttm: Implement ttm_bo_update_priority
>   drm/ttm: Consider BOs placed in non-favorite locations evicted
>   drm/amdgpu: Set a default priority for user/kernel BOs
>   drm/amdgpu: Implement SET_PRIORITY GEM op
>   drm/amdgpu: Implement EVICTED_VRAM query
>   drm/amdgpu: Bump minor version
>
>  drivers/gpu/drm/amd/amdgpu/amdgpu.h        |   2 -
>  drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c     | 191 +---------------
>  drivers/gpu/drm/amd/amdgpu/amdgpu_cs.h     |   4 -
>  drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c    |   3 +-
>  drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c    |  25 +-
>  drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c    |   3 +
>  drivers/gpu/drm/amd/amdgpu/amdgpu_object.c |  26 ++-
>  drivers/gpu/drm/amd/amdgpu/amdgpu_object.h |   4 +
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c    |  50 ++++
>  drivers/gpu/drm/ttm/ttm_bo.c               | 253 ++++++++++++++++++++-
>  drivers/gpu/drm/ttm/ttm_bo_util.c          |   3 +
>  drivers/gpu/drm/ttm/ttm_device.c           |   1 +
>  drivers/gpu/drm/ttm/ttm_resource.c         |  19 +-
>  include/drm/ttm/ttm_bo.h                   |  22 ++
>  include/drm/ttm/ttm_device.h               |  28 +++
>  include/drm/ttm/ttm_resource.h             |  11 +-
>  include/uapi/drm/amdgpu_drm.h              |   3 +
>  17 files changed, 430 insertions(+), 218 deletions(-)
>
> --
> 2.44.0
>