Hey Guys, One particular piece I'd like to discuss is how to get around the below issue: > Known current obstacles: > ------------------------ > > The SQ is currently programmed to disregard the HQD priorities, and instead it picks > jobs at random. Settings from the shader itself are also disregarded as this is > considered a privileged field. > > Effectively we can get our compute wavefront launched ASAP, but we might not get the > time we need on the SQ. > > The current programming would have to be changed to allow priority propagation > from the HQD into the SQ. 1) Is this still an issue if we do the CU reservation that Serguei mentioned? 2) If the SQ respected the HQD priorities, would we still need the CU reservation? 3) Would updating the golden register settings be sufficient to change this behavior? Or would we also need a FW change? Regards, Andres ________________________________________ From: Andres Rodriguez Sent: Friday, December 16, 2016 6:15 PM To: amd-gfx at lists.freedesktop.org Subject: [RFC] Mechanism for high priority scheduling in amdgpu Hi Everyone, This RFC is also available as a gist here: https://gist.github.com/lostgoat/7000432cd6864265dbc2c3ab93204249 We are interested in feedback for a mechanism to effectively schedule high priority VR reprojection tasks (also referred to as time-warping) for Polaris10 running on the amdgpu kernel driver. Brief context: -------------- The main objective of reprojection is to avoid motion sickness for VR users in scenarios where the game or application would fail to finish rendering a new frame in time for the next VBLANK. When this happens, the user's head movements are not reflected on the Head Mounted Display (HMD) for the duration of an extra frame. This extended mismatch between the inner ear and the eyes may cause the user to experience motion sickness. The VR compositor deals with this problem by fabricating a new frame using the user's updated head position in combination with the previous frames. This avoids a prolonged mismatch between the HMD output and the inner ear. Because of the adverse effects on the user, we require high confidence that the reprojection task will complete before the VBLANK interval. Even if the GFX pipe is currently full of work from the game/application (which is most likely the case). For more details and illustrations, please refer to the following document: https://community.amd.com/community/gaming/blog/2016/03/28/asynchronous-shaders-evolved Requirements: ------------- The mechanism must expose the following functionaility: * Job round trip time must be predictable, from submission to fence signal * The mechanism must support compute workloads. Goals: ------ * The mechanism should provide low submission latencies Test: submitting a NOP packet through the mechanism on busy hardware should be equivalent to submitting a NOP on idle hardware. Nice to have: ------------- * The mechanism should also support GFX workloads. My understanding is that with the current hardware capabilities in Polaris10 we will not be able to provide a solution compatible with GFX worloads. But I would love to hear otherwise. So if anyone has an idea, approach or suggestion that will also be compatible with the GFX ring, please let us know about it. * The above guarantees should also be respected by amdkfd workloads Would be good to have for consistency, but not strictly necessary as users running games are not traditionally running HPC workloads in the background. Proposed approach: ------------------ Similar to the windows driver, we could expose a high priority compute queue to userspace. Submissions to this compute queue will be scheduled with high priority, and may acquire hardware resources previously in use by other queues. This can be achieved by taking advantage of the 'priority' field in the HQDs and could be programmed by amdgpu or the amdgpu scheduler. The relevant register fields are: * mmCP_HQD_PIPE_PRIORITY * mmCP_HQD_QUEUE_PRIORITY Implementation approach 1 - static partitioning: ------------------------------------------------ The amdgpu driver currently controls 8 compute queues from pipe0. We can statically partition these as follows: * 7x regular * 1x high priority The relevant priorities can be set so that submissions to the high priority ring will starve the other compute rings and the GFX ring. The amdgpu scheduler will only place jobs into the high priority rings if the context is marked as high priority. And a corresponding priority should be added to keep track of this information: * AMD_SCHED_PRIORITY_KERNEL * -> AMD_SCHED_PRIORITY_HIGH * AMD_SCHED_PRIORITY_NORMAL The user will request a high priority context by setting an appropriate flag in drm_amdgpu_ctx_in (AMDGPU_CTX_HIGH_PRIORITY or similar): https://github.com/torvalds/linux/blob/master/include/uapi/drm/amdgpu_drm.h#L163 The setting is in a per context level so that we can: * Maintain a consistent FIFO ordering of all submissions to a context * Create high priority and non-high priority contexts in the same process Implementation approach 2 - dynamic priority programming: --------------------------------------------------------- Similar to the above, but instead of programming the priorities and amdgpu_init() time, the SW scheduler will reprogram the queue priorities dynamically when scheduling a task. This would involve having a hardware specific callback from the scheduler to set the appropriate queue priority: set_priority(int ring, int index, int priority) During this callback we would have to grab the SRBM mutex to perform the appropriate HW programming, and I'm not really sure if that is something we should be doing from the scheduler. On the positive side, this approach would allow us to program a range of priorities for jobs instead of a single "high priority" value", achieving something similar to the niceness API available for CPU scheduling. I'm not sure if this flexibility is something that we would need for our use case, but it might be useful in other scenarios (multiple users sharing compute time on a server). This approach would require a new int field in drm_amdgpu_ctx_in, or repurposing of the flags field. Known current obstacles: ------------------------ The SQ is currently programmed to disregard the HQD priorities, and instead it picks jobs at random. Settings from the shader itself are also disregarded as this is considered a privileged field. Effectively we can get our compute wavefront launched ASAP, but we might not get the time we need on the SQ. The current programming would have to be changed to allow priority propagation from the HQD into the SQ. Generic approach for all HW IPs: -------------------------------- For consistency purposes, the high priority context can be enabled for all HW IPs with support of the SW scheduler. This will function similarly to the current AMD_SCHED_PRIORITY_KERNEL priority, where the job can jump ahead of anything not commited to the HW queue. The benefits of requesting a high priority context for a non-compute queue will be lesser (e.g. up to 10s of wait time if a GFX command is stuck in front of you), but having the API in place will allow us to easily improve the implementation in the future as new features become available in new hardware. Future steps: ------------- Once we have an approach settled, I can take care of the implementation. Also, once the interface is mostly decided, we can start thinking about exposing the high priority queue through radv. Request for feedback: --------------------- We aren't married to any of the approaches outlined above. Our goal is to obtain a mechanism that will allow us to complete the reprojection job within a predictable amount of time. So if anyone anyone has any suggestions for improvements or alternative strategies we are more than happy to hear them. If any of the technical information above is also incorrect, feel free to point out my misunderstandings. Looking forward to hearing from you. Regards, Andres