Hi everyone, recently I've been looking into remedies for apps (in particular, newer games) that experience significant performance loss when they start to hit VRAM limits, especially on older or lower-end cards that struggle to fit both desktop apps and all the game data into VRAM at once. The root of the problem lies in the fact that from userspace's POV, buffer eviction is very opaque: Userspace applications/drivers cannot tell how oversubscribed VRAM is, nor do they have fine-grained control over which buffers get evicted. At the same time, with GPU APIs becoming increasingly lower-level and GPU-driven, only the application itself can know which buffers are used within a particular submission, and how important each buffer is. For this, GPU APIs include interfaces to query oversubscription and specify memory priorities: In Vulkan, oversubscription can be queried through the VK_EXT_memory_budget extension. Different buffers can also be assigned priorities via the VK_EXT_pageable_device_local_memory extension. Modern games, especially D3D12 games via vkd3d-proton, rely on oversubscription being reported and priorities being respected in order to perform their memory management. However, relaying this information to the kernel via the current KMD uAPIs is not possible. On AMDGPU for example, all work submissions include a "bo list" that contains any buffer object that is accessed during the course of the submission. If VRAM is oversubscribed and a buffer in the list was evicted to system memory, that buffer is moved back to VRAM (potentially evicting other unused buffers). Since the usermode driver doesn't know what buffers are used by the application, its only choice is to submit a bo list that contains every buffer the application has allocated. In case of VRAM oversubscription, it is highly likely that some of the application's buffers were evicted, which almost guarantees that some buffers will get moved around. Since the bo list is only known at submit time, this also means the buffers will get moved right before submitting application work, which is the worst possible time to move buffers from a latency perspective. Another consequence of the large bo list is that nearly all memory from other applications will be evicted, too. When different applications (e.g. game and compositor) submit work one after the other, this causes a ping-pong effect where each app's submission evicts the other app's memory, resulting in a large amount of unnecessary moves. This overly aggressive eviction behavior led to RADV adopting a change that effectively allows all VRAM applications to reside in system memory [1]. This worked around the ping-ponging/excessive buffer moving problem, but also meant that any memory evicted to system memory would forever stay there, regardless of how VRAM is used. My proposal aims at providing a middle ground between these extremes. The goals I want to meet are: - Userspace is accurately informed about VRAM oversubscription/how much VRAM has been evicted - Buffer eviction respects priorities set by userspace - Wasteful ping-ponging is avoided to the extent possible I have been testing out some prototypes, and came up with this rough sketch of an API: - For each ttm_resource_manager, the amount of evicted memory is tracked (similarly to how "usage" tracks the memory usage). When memory is evicted via ttm_bo_evict, the size of the evicted memory is added, when memory is un-evicted (see below), its size is subtracted. The amount of evicted memory for e.g. VRAM can be queried by userspace via an ioctl. - Each ttm_resource_manager maintains a list of evicted buffer objects. - ttm_mem_unevict walks the list of evicted bos for a given ttm_resource_manager and tries moving evicted resources back. When a buffer is freed, this function is called to immediately restore some evicted memory. - Each ttm_buffer_object independently tracks the mem_type it wants to reside in. - ttm_bo_try_unevict is added as a helper function which attempts to move the buffer to its preferred mem_type. If no space is available there, it fails with -ENOSPC/-ENOMEM. - Similar to how ttm_bo_evict works, each driver can implement uneviction_valuable/unevict_flags callbacks to control buffer un-eviction. This is what patches 1-10 accomplish (together with an amdgpu implementation utilizing the new API). Userspace priorities could then be implemented as follows: - TTM already manages priorities for each buffer object. These priorities can be updated by userspace via a GEM_OP ioctl to inform the kernel which buffers should be evicted before others. If an ioctl increases the priority of a buffer, ttm_bo_try_unevict is called on that buffer to try and move it back (potentially evicting buffers with a lower priority) - Buffers should never be evicted by other buffers with equal/lower priority, but if there is a buffer with lower priority occupying VRAM, it should be evicted in favor of the higher-priority one. This prevents ping-ponging between buffers that try evicting each other and is trivially implementable with an early-exit in ttm_mem_evict_first. This is covered in patches 11-15, with the new features exposed to userspace in patches 16-18. I also have a RADV branch utilizing this API at [2], which I use for testing. This implementation is stil very much WIP, although the D3D12 games I tested already seemed to benefit from it. Nevertheless, are still quite a few TODOs and unresolved questions/problems. Some kernel drivers (e.g i915) already use TTM priorities for kernel-internal purposes. Of course, some of the highest priorities should stay reserved for these purposes (with userspace being able to use the lower priorities). Another problem with priorities is the possibility of apps starving other apps by occupying all of VRAM with high-priority allocations. A possible solution could be include restricting the highest priority/priorities to important apps like compositors. Tying into this problem, only apps that are actively cooperating to reduce memory pressure can benefit from the current memory priority implementation. Eventually the priority system could also be utilized to benefit all applications, for example with the desktop environment boosting the priority of the currently-focused app/its cgroup (to provide the best QoS to the apps the user is actively using). A full implementation of this is probably out-of-scope for this initial proposal, but it's probably a good idea to consider this as a possible future use of the priority API. I'm primarily looking to integrate this into amdgpu to solve the issues I've seen there, but I'm also interested in feedback from other drivers. Is this something you'd be interested in? Do you have any objections/comments/questions about my proposed design? Thanks, Friedrich [1] https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/6833 [2] https://gitlab.freedesktop.org/pixelcluster/mesa/-/tree/spilling Friedrich Vock (18): drm/ttm: Add tracking for evicted memory drm/ttm: Add per-BO eviction tracking drm/ttm: Implement BO eviction tracking drm/ttm: Add driver funcs for uneviction control drm/ttm: Add option to evict no BOs in operation drm/ttm: Add public buffer eviction/uneviction functions drm/amdgpu: Add TTM uneviction control functions drm/amdgpu: Don't try moving BOs to preferred domain before submit drm/amdgpu: Don't mark VRAM as a busy placement for VRAM|GTT resources drm/amdgpu: Don't add GTT to initial domains after failing to allocate VRAM drm/ttm: Bump BO priority count drm/ttm: Do not evict BOs with higher priority drm/ttm: Implement ttm_bo_update_priority drm/ttm: Consider BOs placed in non-favorite locations evicted drm/amdgpu: Set a default priority for user/kernel BOs drm/amdgpu: Implement SET_PRIORITY GEM op drm/amdgpu: Implement EVICTED_VRAM query drm/amdgpu: Bump minor version drivers/gpu/drm/amd/amdgpu/amdgpu.h | 2 - drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c | 191 +--------------- drivers/gpu/drm/amd/amdgpu/amdgpu_cs.h | 4 - drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 3 +- drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c | 25 +- drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c | 3 + drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 26 ++- drivers/gpu/drm/amd/amdgpu/amdgpu_object.h | 4 + drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 50 ++++ drivers/gpu/drm/ttm/ttm_bo.c | 253 ++++++++++++++++++++- drivers/gpu/drm/ttm/ttm_bo_util.c | 3 + drivers/gpu/drm/ttm/ttm_device.c | 1 + drivers/gpu/drm/ttm/ttm_resource.c | 19 +- include/drm/ttm/ttm_bo.h | 22 ++ include/drm/ttm/ttm_device.h | 28 +++ include/drm/ttm/ttm_resource.h | 11 +- include/uapi/drm/amdgpu_drm.h | 3 + 17 files changed, 430 insertions(+), 218 deletions(-) -- 2.44.0