In a nutshell: This RFC proposes a control mechanism for VRAM (GPU local memory) memory pinning that is initiated by HSA processes. This control mechanism is proposed in order to prevent starvation of graphic applications due to high VRAM usage by HSA processes. TOC: ---------------------------------------------------------------- 1. amdkfd's VRAM-related IOCTLs overview 2. TTM BOs migration overview 3. The why 4. Analyzing the use-cases 5. Proposed mechanism 6. Conclusion ---------------------------------------------------------------- 1. amdkfd's VRAM-related IOCTLs overview: amdkfd provides four IOCTLs for VRAM allocation & mapping (the names below are presented just for convinience and can be changed until the final implementation) : - Allocate memory on VRAM -> AMDKFD_IOC_ALLOC_MEMORY_ON_GPU - Free memory on VRAM -> AMDKFD_IOC_FREE_MEMORY_ON_GPU - Map memory to GPU -> AMDKFD_IOC_MAP_MEMORY_TO_GPU - Unmap memory to GPU -> AMDKFD_IOC_UNMAP_MEMORY_FROM_GPU An HSA process which needs to use VRAM, first calls the AMDKFD_IOC_ALLOC_MEMORY_ON_GPU IOCTL. This IOCTL allocates a list of BOs (Buffer Objects) that represent the amount of memory the HSA process wanted to allocate. e.g. If a single BO represent 1MB of VRAM, than amdkfd will allocate a list of 100 BOs for an allocation request of 100MB of VRAM. Before the memory can be used, the HSA process needs to call the AMDKFD_IOC_MAP_MEMORY_TO_GPU IOCTL. This IOCTL pins the relevant BOs (part or all of the BOs that were created in the alloc IOCTL) and updates the PT/PD of the GPUVM. e.g. In regard to the previous example, if the HSA process wants to dispatch a kernel that will use the last 10MB (of the 100MB it allocated), then amdkfd will pin the last ten BOs in the list. After the GPU kernel has finished using the memory, the HSA process needs to call the AMDKFD_IOC_UNMAP_MEMORY_FROM_GPU IOCTL. This IOCTL unpins the BOs and updates the PT/PD of the GPUVM. If the HSA process wants to dispatch another GPU kernel which will use the same memory, than it can again call the AMDKFD_IOC_MAP_MEMORY_TO_GPU IOCTL. After the kernel finishes, the HSA process needs to call the AMDKFD_IOC_UNMAP_MEMORY_FROM_GPU IOCTL. Finally, when the memory has no more use, the HSA process needs to call the AMDKFD_IOC_FREE_MEMORY_ON_GPU IOCTL. This IOCTL destroys the BOs. This action will also be performed on process tear-down. The important point to remember is that once the HSA process calls the AMDKFD_IOC_MAP_MEMORY_TO_GPU IOCTL and amdkfd pins a list of BOs, than from amdkfd's POV, those BOs are in use and must not be unpinned & moved, even if they are currently idle (not used by a GPU kernel). 2. TTM subsystem overview: For those unfamiliar with TTM, here is a short overview regarding migration of BOs in TTM (Note, this is a simplistic overview): Every BO has a reservation point (fence) attached to it. When the GPU has finished working with that BO, it writes to its resv. point to signal the work has been done and the BO is now idle. To enable this mechanism, the graphic driver (radeon) dispatches a fence packet after each CS. TTM maintains an LRU list of BOs. All the BOs are on that list, regardless if they are in use or idle, pinned or unpinned. When TTM encounters a memory pressure situation (e.g. it tries to pin a BO on VRAM but does not have enough space), it walks over the LRU list and tries to evict BOs who are placed in VRAM *and* are idle (meaning that they can be migrated to GART or system memory) until it has enough space for the new request. How TTM finds out if a BO is idle or not ? It checks its reservation point. If it is signaled, then the BO is idle and can be migrated. If not, that BO is still in use. The check is done in two stages. First, TTM does a simple check that asks if a fence is signaled or not and this one is called in atomic context, so the device driver can't block. The second check is the wait_until_signaled and that function is can block, but there is a timeout enforced by TTM. What is a reservation point ? It is a generic Linux kernel mechanism to allow sharing of fences between different device drivers. In our case, TTM assigns a reservation point to every BO. When TTM checks the BO's reservation point, it actually calls a callback function of that resv. point that tells it if the resv. point's fence has been signaled. The callback function is implemented by the entity using the BO. e.g. radeon driver. When that callback is called, radeon needs to respond whether that BO is idle or not. radeon has that information because it dispatches a fence packet after each CS. That way, when the GPU kernel has finished, the GPU handles the fence packet and writes to that fence. When radeon checks if a BO is idle, it actually checks if its fence has been written to by the GPU. Now, back to the migration process. If the BO is in use, TTM just moves to the next BO on the LRU list. If the BO is idle, TTM migrates it to GART or system memory to clear space for the new BO. If there is not enough memory for the new request after passing over the entire LRU list, TTM fails the new BO validation request. 3. The why: HSA userspace applications sometimes need to use VRAM (GPU local memory) for their operation. This is especially true when running on discrete GPUs, which have a high bandwidth on-chip memory. Because current AMD GPUs don't support page faults in VRAM, the HSA application needs to pin its allocated memory in VRAM before dispatching the GPU kernel. To allocate and pin the VRAM, HSA applications call amdkfd's IOCTLs that use the TTM subsystem to allocate and pin BOs on VRAM. Up until now, this is similar to a graphic application allocating memory on VRAM through radeon. However, in radeon, the CS is done through the driver's IOCTL. Therefore, the radeon driver can put a fence packet after every CS to enable the TTM to know if a BO is currently in use by a CS. In contrast, in HSA the CS is done through usermode queues. Because of that reason amdkfd can *not* put a fence packet after each CS and of course we can't trust the userspace to do it. Therefore, the Linux kernel does *not* have the visibility whether a BO is currently in use or not. This creates a problem when dealing with a memory pressure on a system that runs both HSA applications and VRAM-consuming graphic applications. When memory pressure occurs due to VRAM allocations requests from graphics applications, the graphic CS can fail because HSA BOs are pinned in VRAM and can't be swapped out to GART/System memory, even if the BOs are currently idle. In addition, there can also be a situation where an HSA-only system has memory pressure due to fragmentation in the VRAM. 4. Analyzing the use-cases The following describes different scenarios of system behavior regarding VRAM usage: - Graphics needs a buffer in a specific range (several cases for that). This means that *all* VRAM allocations must be evicted, no matter what (including HSA). - Graphics is to be prioritized over HSA (e.g. desktop computer case). All graphics allocations take precedence over HSA. i.e. HSA must always yield to TTM asking to evict BOs. - Graphics is not important or not even existant (e.g. server). Then, HSA eviction can fail. However, even in this case there might still be VRAM fragmentation problem that will prevent HSA pinning. 5. Proposed mechanism The proposed mechanism is composed of two parts: - Policy set by the system admin - Allowing the TTM to evict HSA BOs 5.a. Policy Because we need to support different scenarios as described above, I suggest to give the system admin the ability to select the VRAM usage policy. This selection will dictate the behavior of amdkfd in this regard. The policy could be one of the following options: - VRAM usage: prefer graphics applications - VRAM usage: Prefer HSA applications When the first option is chosen (prefer graphics), upon *each* request to evict BO from VRAM, amdkfd will respond as if the BO is idle. When the second option is chosen (prefer HSA), upon *each* request to evict BO from VRAM, amdkfd will respond as if the BO is in use. Because this is a new policy that we might want to tweak in the future, I think that it should currently be accessed only through debugfs. Once things are mature enough and people will fill confident in it, this policy can be turned to either a kernel parameter or sysfs attribute or both. The default policy, IMO, should be "prefer graphics applications". Note that even with the policy set to "prefer graphics", we must not evict the BOs of the PT/PD 5.b. Eviction process To allow TTM to evict a BO from VRAM, amdkfd effectively needs to preempt a running usermode queue. On Carrizo we can preempt a queue whenever we want. However, when using Kaveri we could run into problems when trying to preempt a queue. The problems can appear in the case where a shader takes a very long time to complete (hundreds of ms), or in the rare case where someone wrote an infinite shader (bug or otherwise). In those cases, Kaveri will fail to preempt the queue, amdkfd will indicate a failure (dmesg) and the CP will probably be stuck. In those cases, the only option left for the driver is to perform an operation called "kill all waves". This would terminate all the running waves and allow the CP to preempt the queues. In addition, the BOs that are created need to set the callback function of the resv. point to amdkfd. However, for the BOs of the PT/PD, we need to set a different callback function so we could prevent the eviction of those BOs. The suggested algorithm for eviction is (in case policy is to prefer graphics): - TTM calls amdkfd callback, asking if a BO is idle - amdkfd preempts user space queue and removes it from run-list - in case the preemption is stuck, amdkfd kills the wave. - amdkfd tells TTM that the BO is idle - TTM evict buffer to GART - amdkfd updates GPUVM page table and does all necessary TLB flushing - amdkfd restores user space queue 6. Conclusion The current status of the code is that the four IOCTLs mentioned in point 1 are partially implemented. The mechanism described here is not implemented yet as I first wanted to get some response. So although part of the code is ready, I would like to publish the patches as a single patch-set. I would like to thank RH's Jerome Glisse for helping me with this RFC. Comments and flames are welcome. Thanks, Oded _______________________________________________ dri-devel mailing list dri-devel@xxxxxxxxxxxxxxxxxxxxx http://lists.freedesktop.org/mailman/listinfo/dri-devel