[RFC] Mechanism for high priority scheduling in amdgpu

andresx7@xxxxxxxxx (Andres Rodriguez) · Tue, 20 Dec 2016 10:51:43 -0500

Hi Christian,

That is definitely a concern. What we are currently thinking is to make 
the high priority queues accessible to root only.

Therefore is a non-root user attempts to set the high priority flag on 
context allocation, we would fail the call and return ENOPERM.

Regards,
Andres

On 12/20/2016 7:56 AM, Christian KÃ¶nig wrote:
>> BTW: If there is  non-VR application which will use high-priority
>> h/w queue then VR application will suffer.  Any ideas how
>> to solve it?
> Yeah, that problem came to my mind as well.
>
> Basically we need to restrict those high priority submissions to the 
> VR compositor or otherwise any malfunctioning application could use it.
>
> Just think about some WebGL suddenly taking all our rendering away and 
> we won't get anything drawn any more.
>
> Alex or Michel any ideas on that?
>
> Regards,
> Christian.
>
> Am 19.12.2016 um 15:48 schrieb Serguei Sagalovitch:
>> > If compute queue is occupied only by you, the efficiency
>> > is equal with setting job queue to high priority I think.
>> The only risk is the situation when graphics will take all
>> needed CUs. But in any case it should be very good test.
>>
>> Andres/Pierre-Loup,
>>
>> Did you try to do it or it is a lot of work for you?
>>
>>
>> BTW: If there is  non-VR application which will use high-priority
>> h/w queue then VR application will suffer.  Any ideas how
>> to solve it?
>>
>> Sincerely yours,
>> Serguei Sagalovitch
>>
>> On 2016-12-19 12:50 AM, zhoucm1 wrote:
>>> Do you encounter the priority issue for compute queue with current 
>>> driver?
>>>
>>> If compute queue is occupied only by you, the efficiency is equal 
>>> with setting job queue to high priority I think.
>>>
>>> Regards,
>>> David Zhou
>>>
>>> On 2016å¹´12æ??19æ?¥ 13:29, Andres Rodriguez wrote:
>>>> Yes, vulkan is available on all-open through the mesa radv UMD.
>>>>
>>>> I'm not sure if I'm asking for too much, but if we can coordinate a 
>>>> similar interface in radv and amdgpu-pro at the vulkan level that 
>>>> would be great.
>>>>
>>>> I'm not sure what that's going to be yet.
>>>>
>>>> - Andres
>>>>
>>>> On 12/19/2016 12:11 AM, zhoucm1 wrote:
>>>>>
>>>>>
>>>>> On 2016å¹´12æ??19æ?¥ 11:33, Pierre-Loup A. Griffais wrote:
>>>>>> We're currently working with the open stack; I assume that a 
>>>>>> mechanism could be exposed by both open and Pro Vulkan userspace 
>>>>>> drivers and that the amdgpu kernel interface improvements we 
>>>>>> would pursue following this discussion would let both drivers 
>>>>>> take advantage of the feature, correct?
>>>>> Of course.
>>>>> Does open stack have Vulkan support?
>>>>>
>>>>> Regards,
>>>>> David Zhou
>>>>>>
>>>>>> On 12/18/2016 07:26 PM, zhoucm1 wrote:
>>>>>>> By the way, are you using all-open driver or amdgpu-pro driver?
>>>>>>>
>>>>>>> +David Mao, who is working on our Vulkan driver.
>>>>>>>
>>>>>>> Regards,
>>>>>>> David Zhou
>>>>>>>
>>>>>>> On 2016å¹´12æ??18æ?¥ 06:05, Pierre-Loup A. Griffais wrote:
>>>>>>>> Hi Serguei,
>>>>>>>>
>>>>>>>> I'm also working on the bringing up our VR runtime on top of 
>>>>>>>> amgpu;
>>>>>>>> see replies inline.
>>>>>>>>
>>>>>>>> On 12/16/2016 09:05 PM, Sagalovitch, Serguei wrote:
>>>>>>>>> Andres,
>>>>>>>>>
>>>>>>>>>>  For current VR workloads we have 3 separate processes running
>>>>>>>>>> actually:
>>>>>>>>> So we could have potential memory overcommit case or do you do
>>>>>>>>> partitioning
>>>>>>>>> on your own?  I would think that there is need to avoid 
>>>>>>>>> overcomit in
>>>>>>>>> VR case to
>>>>>>>>> prevent any BO migration.
>>>>>>>>
>>>>>>>> You're entirely correct; currently the VR runtime is setting up
>>>>>>>> prioritized CPU scheduling for its VR compositor, we're working on
>>>>>>>> prioritized GPU scheduling and pre-emption (eg. this thread), 
>>>>>>>> and in
>>>>>>>> the future it will make sense to do work in order to make sure 
>>>>>>>> that
>>>>>>>> its memory allocations do not get evicted, to prevent any 
>>>>>>>> unwelcome
>>>>>>>> additional latency in the event of needing to perform just-in-time
>>>>>>>> reprojection.
>>>>>>>>
>>>>>>>>> BTW: Do you mean __real__ processes or threads?
>>>>>>>>> Based on my understanding sharing BOs between different processes
>>>>>>>>> could introduce additional synchronization constrains. btw: I 
>>>>>>>>> am not
>>>>>>>>> sure
>>>>>>>>> if we are able to share Vulkan sync. object cross-process 
>>>>>>>>> boundary.
>>>>>>>>
>>>>>>>> They are different processes; it is important for the 
>>>>>>>> compositor that
>>>>>>>> is responsible for quality-of-service features such as 
>>>>>>>> consistently
>>>>>>>> presenting distorted frames with the right latency, 
>>>>>>>> reprojection, etc,
>>>>>>>> to be separate from the main application.
>>>>>>>>
>>>>>>>> Currently we are using unreleased cross-process memory and 
>>>>>>>> semaphore
>>>>>>>> extensions to fetch updated eye images from the client 
>>>>>>>> application,
>>>>>>>> but the just-in-time reprojection discussed here does not actually
>>>>>>>> have any direct interactions with cross-process resource sharing,
>>>>>>>> since it's achieved by using whatever is the latest, most 
>>>>>>>> up-to-date
>>>>>>>> eye images that have already been sent by the client application,
>>>>>>>> which are already available to use without additional 
>>>>>>>> synchronization.
>>>>>>>>
>>>>>>>>>
>>>>>>>>>>    3) System compositor (we are looking at approaches to 
>>>>>>>>>> remove this
>>>>>>>>>> overhead)
>>>>>>>>> Yes,  IMHO the best is to run in  "full screen mode".
>>>>>>>>
>>>>>>>> Yes, we are working on mechanisms to present directly to the 
>>>>>>>> headset
>>>>>>>> display without any intermediaries as a separate effort.
>>>>>>>>
>>>>>>>>>
>>>>>>>>>>  The latency is our main concern,
>>>>>>>>> I would assume that this is the known problem (at least for 
>>>>>>>>> compute
>>>>>>>>> usage).
>>>>>>>>> It looks like that amdgpu / kernel submission is rather CPU 
>>>>>>>>> intensive
>>>>>>>>> (at least
>>>>>>>>> in the default configuration).
>>>>>>>>
>>>>>>>> As long as it's a consistent cost, it shouldn't an issue. 
>>>>>>>> However, if
>>>>>>>> there's high degrees of variance then that would be troublesome 
>>>>>>>> and we
>>>>>>>> would need to account for the worst case.
>>>>>>>>
>>>>>>>> Hopefully the requirements and approach we described make 
>>>>>>>> sense, we're
>>>>>>>> looking forward to your feedback and suggestions.
>>>>>>>>
>>>>>>>> Thanks!
>>>>>>>>  - Pierre-Loup
>>>>>>>>
>>>>>>>>>
>>>>>>>>> Sincerely yours,
>>>>>>>>> Serguei Sagalovitch
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> From: Andres Rodriguez <andresr at valvesoftware.com>
>>>>>>>>> Sent: December 16, 2016 10:00 PM
>>>>>>>>> To: Sagalovitch, Serguei; amd-gfx at lists.freedesktop.org
>>>>>>>>> Subject: RE: [RFC] Mechanism for high priority scheduling in 
>>>>>>>>> amdgpu
>>>>>>>>>
>>>>>>>>> Hey Serguei,
>>>>>>>>>
>>>>>>>>>> [Serguei] No. I mean pipe :-) as MEC define it.  As far as I
>>>>>>>>>> understand (by simplifying)
>>>>>>>>>> some scheduling is per pipe.  I know about the current 
>>>>>>>>>> allocation
>>>>>>>>>> scheme but I do not think
>>>>>>>>>> that it is  ideal.  I would assume that we need to switch to
>>>>>>>>>> dynamical partition
>>>>>>>>>> of resources  based on the workload otherwise we will have 
>>>>>>>>>> resource
>>>>>>>>>> conflict
>>>>>>>>>> between Vulkan compute and  OpenCL.
>>>>>>>>>
>>>>>>>>> I agree the partitioning isn't ideal. I'm hoping we can start 
>>>>>>>>> with a
>>>>>>>>> solution that assumes that
>>>>>>>>> only pipe0 has any work and the other pipes are idle (no HSA/ROCm
>>>>>>>>> running on the system).
>>>>>>>>>
>>>>>>>>> This should be more or less the use case we expect from VR users.
>>>>>>>>>
>>>>>>>>> I agree the split is currently not ideal, but I'd like to 
>>>>>>>>> consider
>>>>>>>>> that a separate task, because
>>>>>>>>> making it dynamic is not straight forward :P
>>>>>>>>>
>>>>>>>>>> [Serguei] Vulkan works via amdgpu (kernel submissions) so amdkfd
>>>>>>>>>> will be not
>>>>>>>>>> involved.  I would assume that in the case of VR we will have 
>>>>>>>>>> one main
>>>>>>>>>> application ("console" mode(?)) so we could temporally "ignore"
>>>>>>>>>> OpenCL/ROCm needs when VR is running.
>>>>>>>>>
>>>>>>>>> Correct, this is why we want to enable the high priority compute
>>>>>>>>> queue through
>>>>>>>>> libdrm-amdgpu, so that we can expose it through Vulkan later.
>>>>>>>>>
>>>>>>>>> For current VR workloads we have 3 separate processes running 
>>>>>>>>> actually:
>>>>>>>>>     1) Game process
>>>>>>>>>     2) VR Compositor (this is the process that will require high
>>>>>>>>> priority queue)
>>>>>>>>>     3) System compositor (we are looking at approaches to 
>>>>>>>>> remove this
>>>>>>>>> overhead)
>>>>>>>>>
>>>>>>>>> For now I think it is okay to assume no OpenCL/ROCm running
>>>>>>>>> simultaneously, but
>>>>>>>>> I would also like to be able to address this case in the future
>>>>>>>>> (cross-pipe priorities).
>>>>>>>>>
>>>>>>>>>> [Serguei]  The problem with pre-emption of graphics task:  
>>>>>>>>>> (a) it
>>>>>>>>>> may take time so
>>>>>>>>>> latency may suffer
>>>>>>>>>
>>>>>>>>> The latency is our main concern, we want something that is
>>>>>>>>> predictable. A good
>>>>>>>>> illustration of what the reprojection scheduling looks like 
>>>>>>>>> can be
>>>>>>>>> found here:
>>>>>>>>> https://community.amd.com/servlet/JiveServlet/showImage/38-1310-104754/pastedImage_3.png 
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> (b) to preempt we need to have different "context" - we want
>>>>>>>>>> to guarantee that submissions from the same context will be 
>>>>>>>>>> executed
>>>>>>>>>> in order.
>>>>>>>>>
>>>>>>>>> This is okay, as the reprojection work doesn't have 
>>>>>>>>> dependencies on
>>>>>>>>> the game context, and it
>>>>>>>>> even happens in a separate process.
>>>>>>>>>
>>>>>>>>>> BTW: (a) Do you want "preempt" and later resume or do you want
>>>>>>>>>> "preempt" and
>>>>>>>>>> "cancel/abort"
>>>>>>>>>
>>>>>>>>> Preempt the game with the compositor task and then resume it.
>>>>>>>>>
>>>>>>>>>> (b) Vulkan is generic API and could be used for graphics as 
>>>>>>>>>> well as
>>>>>>>>>> for plain compute tasks (VK_QUEUE_COMPUTE_BIT).
>>>>>>>>>
>>>>>>>>> Yeah, the plan is to use vulkan compute. But if you figure out 
>>>>>>>>> a way
>>>>>>>>> for us to get
>>>>>>>>> a guaranteed execution time using vulkan graphics, then I'll 
>>>>>>>>> take you
>>>>>>>>> out for a beer :)
>>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>> Andres
>>>>>>>>> ________________________________________
>>>>>>>>> From: Sagalovitch, Serguei [Serguei.Sagalovitch at amd.com]
>>>>>>>>> Sent: Friday, December 16, 2016 9:13 PM
>>>>>>>>> To: Andres Rodriguez; amd-gfx at lists.freedesktop.org
>>>>>>>>> Subject: Re: [RFC] Mechanism for high priority scheduling in 
>>>>>>>>> amdgpu
>>>>>>>>>
>>>>>>>>> Hi Andres,
>>>>>>>>>
>>>>>>>>> Please see inline (as [Serguei])
>>>>>>>>>
>>>>>>>>> Sincerely yours,
>>>>>>>>> Serguei Sagalovitch
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> From: Andres Rodriguez <andresr at valvesoftware.com>
>>>>>>>>> Sent: December 16, 2016 8:29 PM
>>>>>>>>> To: Sagalovitch, Serguei; amd-gfx at lists.freedesktop.org
>>>>>>>>> Subject: RE: [RFC] Mechanism for high priority scheduling in 
>>>>>>>>> amdgpu
>>>>>>>>>
>>>>>>>>> Hi Serguei,
>>>>>>>>>
>>>>>>>>> Thanks for the feedback. Answers inline as [AR].
>>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>> Andres
>>>>>>>>>
>>>>>>>>> ________________________________________
>>>>>>>>> From: Sagalovitch, Serguei [Serguei.Sagalovitch at amd.com]
>>>>>>>>> Sent: Friday, December 16, 2016 8:15 PM
>>>>>>>>> To: Andres Rodriguez; amd-gfx at lists.freedesktop.org
>>>>>>>>> Subject: Re: [RFC] Mechanism for high priority scheduling in 
>>>>>>>>> amdgpu
>>>>>>>>>
>>>>>>>>> Andres,
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Quick comments:
>>>>>>>>>
>>>>>>>>> 1) To minimize "bubbles", etc. we need to "force" CU 
>>>>>>>>> assignments/binding
>>>>>>>>> to high-priority queue  when it will be in use and "free" them 
>>>>>>>>> later
>>>>>>>>> (we  do not want forever take CUs from e.g. graphic task to 
>>>>>>>>> degrade
>>>>>>>>> graphics
>>>>>>>>> performance).
>>>>>>>>>
>>>>>>>>> Otherwise we could have scenario when long graphics task (or
>>>>>>>>> low-priority
>>>>>>>>> compute) will took all (extra) CUs and high--priority will 
>>>>>>>>> wait for
>>>>>>>>> needed resources.
>>>>>>>>> It will not be visible on "NOP " but only when you submit "real"
>>>>>>>>> compute task
>>>>>>>>> so I would recommend  not to use "NOP" packets at all for 
>>>>>>>>> testing.
>>>>>>>>>
>>>>>>>>> It (CU assignment) could be relatively easy done when 
>>>>>>>>> everything is
>>>>>>>>> going via kernel
>>>>>>>>> (e.g. as part of frame submission) but I must admit that I am 
>>>>>>>>> not sure
>>>>>>>>> about the best way for user level submissions (amdkfd).
>>>>>>>>>
>>>>>>>>> [AR] I wasn't aware of this part of the programming sequence. 
>>>>>>>>> Thanks
>>>>>>>>> for the heads up!
>>>>>>>>> Is this similar to the CU masking programming?
>>>>>>>>> [Serguei] Yes. To simplify: the problem is that "scheduler" when
>>>>>>>>> deciding which
>>>>>>>>> queue to  run will check if there is enough resources and if 
>>>>>>>>> not then
>>>>>>>>> it will begin
>>>>>>>>> to check other queues with lower priority.
>>>>>>>>>
>>>>>>>>> 2) I would recommend to dedicate the whole pipe to high-priority
>>>>>>>>> queue and have
>>>>>>>>> nothing their except it.
>>>>>>>>>
>>>>>>>>> [AR] I'm guessing in this context you mean pipe = queue? (as 
>>>>>>>>> opposed
>>>>>>>>> to the MEC definition
>>>>>>>>> of pipe, which is a grouping of queues). I say this because 
>>>>>>>>> amdgpu
>>>>>>>>> only has access to 1 pipe,
>>>>>>>>> and the rest are statically partitioned for amdkfd usage.
>>>>>>>>>
>>>>>>>>> [Serguei] No. I mean pipe :-)  as MEC define it. As far as I
>>>>>>>>> understand (by simplifying)
>>>>>>>>> some scheduling is per pipe.  I know about the current allocation
>>>>>>>>> scheme but I do not think
>>>>>>>>> that it is  ideal.  I would assume that we need to switch to
>>>>>>>>> dynamical partition
>>>>>>>>> of resources  based on the workload otherwise we will have 
>>>>>>>>> resource
>>>>>>>>> conflict
>>>>>>>>> between Vulkan compute and  OpenCL.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> BTW: Which user level API do you want to use for compute: 
>>>>>>>>> Vulkan or
>>>>>>>>> OpenCL?
>>>>>>>>>
>>>>>>>>> [AR] Vulkan
>>>>>>>>>
>>>>>>>>> [Serguei] Vulkan works via amdgpu (kernel submissions) so 
>>>>>>>>> amdkfd will
>>>>>>>>> be not
>>>>>>>>> involved.  I would assume that in the case of VR we will have 
>>>>>>>>> one main
>>>>>>>>> application ("console" mode(?)) so we could temporally "ignore"
>>>>>>>>> OpenCL/ROCm needs when VR is running.
>>>>>>>>>
>>>>>>>>>>  we will not be able to provide a solution compatible with GFX
>>>>>>>>>> worloads.
>>>>>>>>> I assume that you are talking about graphics? Am I right?
>>>>>>>>>
>>>>>>>>> [AR] Yeah, my understanding is that pre-empting the currently 
>>>>>>>>> running
>>>>>>>>> graphics job and scheduling in
>>>>>>>>> something else using mid-buffer pre-emption has some cases 
>>>>>>>>> where it
>>>>>>>>> doesn't work well. But if with
>>>>>>>>> polaris10 it starts working well, it might be a better 
>>>>>>>>> solution for
>>>>>>>>> us (because the whole reprojection
>>>>>>>>> work uses the vulkan graphics stack at the moment, and porting 
>>>>>>>>> it to
>>>>>>>>> compute is not trivial).
>>>>>>>>>
>>>>>>>>> [Serguei]  The problem with pre-emption of graphics task: (a) 
>>>>>>>>> it may
>>>>>>>>> take time so
>>>>>>>>> latency may suffer (b) to preempt we need to have different 
>>>>>>>>> "context"
>>>>>>>>> - we want
>>>>>>>>> to guarantee that submissions from the same context will be 
>>>>>>>>> executed
>>>>>>>>> in order.
>>>>>>>>> BTW: (a) Do you want  "preempt" and later resume or do you want
>>>>>>>>> "preempt" and
>>>>>>>>> "cancel/abort"?  (b) Vulkan is generic API and could be used
>>>>>>>>> for graphics as well as for plain compute tasks 
>>>>>>>>> (VK_QUEUE_COMPUTE_BIT).
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Sincerely yours,
>>>>>>>>> Serguei Sagalovitch
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> From: amd-gfx <amd-gfx-bounces at lists.freedesktop.org> on 
>>>>>>>>> behalf of
>>>>>>>>> Andres Rodriguez <andresr at valvesoftware.com>
>>>>>>>>> Sent: December 16, 2016 6:15 PM
>>>>>>>>> To: amd-gfx at lists.freedesktop.org
>>>>>>>>> Subject: [RFC] Mechanism for high priority scheduling in amdgpu
>>>>>>>>>
>>>>>>>>> Hi Everyone,
>>>>>>>>>
>>>>>>>>> This RFC is also available as a gist here:
>>>>>>>>> https://gist.github.com/lostgoat/7000432cd6864265dbc2c3ab93204249
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> [RFC] Mechanism for high priority scheduling in amdgpu
>>>>>>>>> gist.github.com
>>>>>>>>> [RFC] Mechanism for high priority scheduling in amdgpu
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> [RFC] Mechanism for high priority scheduling in amdgpu
>>>>>>>>> gist.github.com
>>>>>>>>> [RFC] Mechanism for high priority scheduling in amdgpu
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> [RFC] Mechanism for high priority scheduling in amdgpu
>>>>>>>>> gist.github.com
>>>>>>>>> [RFC] Mechanism for high priority scheduling in amdgpu
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> We are interested in feedback for a mechanism to effectively 
>>>>>>>>> schedule
>>>>>>>>> high
>>>>>>>>> priority VR reprojection tasks (also referred to as 
>>>>>>>>> time-warping) for
>>>>>>>>> Polaris10
>>>>>>>>> running on the amdgpu kernel driver.
>>>>>>>>>
>>>>>>>>> Brief context:
>>>>>>>>> --------------
>>>>>>>>>
>>>>>>>>> The main objective of reprojection is to avoid motion sickness 
>>>>>>>>> for VR
>>>>>>>>> users in
>>>>>>>>> scenarios where the game or application would fail to finish
>>>>>>>>> rendering a new
>>>>>>>>> frame in time for the next VBLANK. When this happens, the 
>>>>>>>>> user's head
>>>>>>>>> movements
>>>>>>>>> are not reflected on the Head Mounted Display (HMD) for the 
>>>>>>>>> duration
>>>>>>>>> of an
>>>>>>>>> extra frame. This extended mismatch between the inner ear and the
>>>>>>>>> eyes may
>>>>>>>>> cause the user to experience motion sickness.
>>>>>>>>>
>>>>>>>>> The VR compositor deals with this problem by fabricating a new 
>>>>>>>>> frame
>>>>>>>>> using the
>>>>>>>>> user's updated head position in combination with the previous 
>>>>>>>>> frames.
>>>>>>>>> This
>>>>>>>>> avoids a prolonged mismatch between the HMD output and the 
>>>>>>>>> inner ear.
>>>>>>>>>
>>>>>>>>> Because of the adverse effects on the user, we require high
>>>>>>>>> confidence that the
>>>>>>>>> reprojection task will complete before the VBLANK interval. 
>>>>>>>>> Even if
>>>>>>>>> the GFX pipe
>>>>>>>>> is currently full of work from the game/application (which is 
>>>>>>>>> most
>>>>>>>>> likely the case).
>>>>>>>>>
>>>>>>>>> For more details and illustrations, please refer to the following
>>>>>>>>> document:
>>>>>>>>> https://community.amd.com/community/gaming/blog/2016/03/28/asynchronous-shaders-evolved 
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Gaming: Asynchronous Shaders Evolved | Community
>>>>>>>>> community.amd.com
>>>>>>>>> One of the most exciting new developments in GPU technology 
>>>>>>>>> over the
>>>>>>>>> past year has been the adoption of asynchronous shaders, which 
>>>>>>>>> can
>>>>>>>>> make more efficient use of ...
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Gaming: Asynchronous Shaders Evolved | Community
>>>>>>>>> community.amd.com
>>>>>>>>> One of the most exciting new developments in GPU technology 
>>>>>>>>> over the
>>>>>>>>> past year has been the adoption of asynchronous shaders, which 
>>>>>>>>> can
>>>>>>>>> make more efficient use of ...
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Gaming: Asynchronous Shaders Evolved | Community
>>>>>>>>> community.amd.com
>>>>>>>>> One of the most exciting new developments in GPU technology 
>>>>>>>>> over the
>>>>>>>>> past year has been the adoption of asynchronous shaders, which 
>>>>>>>>> can
>>>>>>>>> make more efficient use of ...
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Requirements:
>>>>>>>>> -------------
>>>>>>>>>
>>>>>>>>> The mechanism must expose the following functionaility:
>>>>>>>>>
>>>>>>>>>     * Job round trip time must be predictable, from submission to
>>>>>>>>> fence signal
>>>>>>>>>
>>>>>>>>>     * The mechanism must support compute workloads.
>>>>>>>>>
>>>>>>>>> Goals:
>>>>>>>>> ------
>>>>>>>>>
>>>>>>>>>     * The mechanism should provide low submission latencies
>>>>>>>>>
>>>>>>>>> Test: submitting a NOP packet through the mechanism on busy 
>>>>>>>>> hardware
>>>>>>>>> should
>>>>>>>>> be equivalent to submitting a NOP on idle hardware.
>>>>>>>>>
>>>>>>>>> Nice to have:
>>>>>>>>> -------------
>>>>>>>>>
>>>>>>>>>     * The mechanism should also support GFX workloads.
>>>>>>>>>
>>>>>>>>> My understanding is that with the current hardware 
>>>>>>>>> capabilities in
>>>>>>>>> Polaris10 we
>>>>>>>>> will not be able to provide a solution compatible with GFX 
>>>>>>>>> worloads.
>>>>>>>>>
>>>>>>>>> But I would love to hear otherwise. So if anyone has an idea,
>>>>>>>>> approach or
>>>>>>>>> suggestion that will also be compatible with the GFX ring, 
>>>>>>>>> please let
>>>>>>>>> us know
>>>>>>>>> about it.
>>>>>>>>>
>>>>>>>>>     * The above guarantees should also be respected by amdkfd 
>>>>>>>>> workloads
>>>>>>>>>
>>>>>>>>> Would be good to have for consistency, but not strictly 
>>>>>>>>> necessary as
>>>>>>>>> users running
>>>>>>>>> games are not traditionally running HPC workloads in the 
>>>>>>>>> background.
>>>>>>>>>
>>>>>>>>> Proposed approach:
>>>>>>>>> ------------------
>>>>>>>>>
>>>>>>>>> Similar to the windows driver, we could expose a high priority
>>>>>>>>> compute queue to
>>>>>>>>> userspace.
>>>>>>>>>
>>>>>>>>> Submissions to this compute queue will be scheduled with high
>>>>>>>>> priority, and may
>>>>>>>>> acquire hardware resources previously in use by other queues.
>>>>>>>>>
>>>>>>>>> This can be achieved by taking advantage of the 'priority' 
>>>>>>>>> field in
>>>>>>>>> the HQDs
>>>>>>>>> and could be programmed by amdgpu or the amdgpu scheduler. The 
>>>>>>>>> relevant
>>>>>>>>> register fields are:
>>>>>>>>>         * mmCP_HQD_PIPE_PRIORITY
>>>>>>>>>         * mmCP_HQD_QUEUE_PRIORITY
>>>>>>>>>
>>>>>>>>> Implementation approach 1 - static partitioning:
>>>>>>>>> ------------------------------------------------
>>>>>>>>>
>>>>>>>>> The amdgpu driver currently controls 8 compute queues from 
>>>>>>>>> pipe0. We can
>>>>>>>>> statically partition these as follows:
>>>>>>>>>         * 7x regular
>>>>>>>>>         * 1x high priority
>>>>>>>>>
>>>>>>>>> The relevant priorities can be set so that submissions to the 
>>>>>>>>> high
>>>>>>>>> priority
>>>>>>>>> ring will starve the other compute rings and the GFX ring.
>>>>>>>>>
>>>>>>>>> The amdgpu scheduler will only place jobs into the high priority
>>>>>>>>> rings if the
>>>>>>>>> context is marked as high priority. And a corresponding priority
>>>>>>>>> should be
>>>>>>>>> added to keep track of this information:
>>>>>>>>>      * AMD_SCHED_PRIORITY_KERNEL
>>>>>>>>>      * -> AMD_SCHED_PRIORITY_HIGH
>>>>>>>>>      * AMD_SCHED_PRIORITY_NORMAL
>>>>>>>>>
>>>>>>>>> The user will request a high priority context by setting an
>>>>>>>>> appropriate flag
>>>>>>>>> in drm_amdgpu_ctx_in (AMDGPU_CTX_HIGH_PRIORITY or similar):
>>>>>>>>> https://github.com/torvalds/linux/blob/master/include/uapi/drm/amdgpu_drm.h#L163 
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> The setting is in a per context level so that we can:
>>>>>>>>>     * Maintain a consistent FIFO ordering of all submissions to a
>>>>>>>>> context
>>>>>>>>>     * Create high priority and non-high priority contexts in 
>>>>>>>>> the same
>>>>>>>>> process
>>>>>>>>>
>>>>>>>>> Implementation approach 2 - dynamic priority programming:
>>>>>>>>> ---------------------------------------------------------
>>>>>>>>>
>>>>>>>>> Similar to the above, but instead of programming the 
>>>>>>>>> priorities and
>>>>>>>>> amdgpu_init() time, the SW scheduler will reprogram the queue 
>>>>>>>>> priorities
>>>>>>>>> dynamically when scheduling a task.
>>>>>>>>>
>>>>>>>>> This would involve having a hardware specific callback from the
>>>>>>>>> scheduler to
>>>>>>>>> set the appropriate queue priority: set_priority(int ring, int 
>>>>>>>>> index,
>>>>>>>>> int priority)
>>>>>>>>>
>>>>>>>>> During this callback we would have to grab the SRBM mutex to 
>>>>>>>>> perform
>>>>>>>>> the appropriate
>>>>>>>>> HW programming, and I'm not really sure if that is something we
>>>>>>>>> should be doing from
>>>>>>>>> the scheduler.
>>>>>>>>>
>>>>>>>>> On the positive side, this approach would allow us to program 
>>>>>>>>> a range of
>>>>>>>>> priorities for jobs instead of a single "high priority" value",
>>>>>>>>> achieving
>>>>>>>>> something similar to the niceness API available for CPU 
>>>>>>>>> scheduling.
>>>>>>>>>
>>>>>>>>> I'm not sure if this flexibility is something that we would 
>>>>>>>>> need for
>>>>>>>>> our use
>>>>>>>>> case, but it might be useful in other scenarios (multiple users
>>>>>>>>> sharing compute
>>>>>>>>> time on a server).
>>>>>>>>>
>>>>>>>>> This approach would require a new int field in 
>>>>>>>>> drm_amdgpu_ctx_in, or
>>>>>>>>> repurposing
>>>>>>>>> of the flags field.
>>>>>>>>>
>>>>>>>>> Known current obstacles:
>>>>>>>>> ------------------------
>>>>>>>>>
>>>>>>>>> The SQ is currently programmed to disregard the HQD 
>>>>>>>>> priorities, and
>>>>>>>>> instead it picks
>>>>>>>>> jobs at random. Settings from the shader itself are also 
>>>>>>>>> disregarded
>>>>>>>>> as this is
>>>>>>>>> considered a privileged field.
>>>>>>>>>
>>>>>>>>> Effectively we can get our compute wavefront launched ASAP, 
>>>>>>>>> but we
>>>>>>>>> might not get the
>>>>>>>>> time we need on the SQ.
>>>>>>>>>
>>>>>>>>> The current programming would have to be changed to allow 
>>>>>>>>> priority
>>>>>>>>> propagation
>>>>>>>>> from the HQD into the SQ.
>>>>>>>>>
>>>>>>>>> Generic approach for all HW IPs:
>>>>>>>>> --------------------------------
>>>>>>>>>
>>>>>>>>> For consistency purposes, the high priority context can be 
>>>>>>>>> enabled
>>>>>>>>> for all HW IPs
>>>>>>>>> with support of the SW scheduler. This will function similarly 
>>>>>>>>> to the
>>>>>>>>> current
>>>>>>>>> AMD_SCHED_PRIORITY_KERNEL priority, where the job can jump 
>>>>>>>>> ahead of
>>>>>>>>> anything not
>>>>>>>>> commited to the HW queue.
>>>>>>>>>
>>>>>>>>> The benefits of requesting a high priority context for a 
>>>>>>>>> non-compute
>>>>>>>>> queue will
>>>>>>>>> be lesser (e.g. up to 10s of wait time if a GFX command is 
>>>>>>>>> stuck in
>>>>>>>>> front of
>>>>>>>>> you), but having the API in place will allow us to easily 
>>>>>>>>> improve the
>>>>>>>>> implementation
>>>>>>>>> in the future as new features become available in new hardware.
>>>>>>>>>
>>>>>>>>> Future steps:
>>>>>>>>> -------------
>>>>>>>>>
>>>>>>>>> Once we have an approach settled, I can take care of the 
>>>>>>>>> implementation.
>>>>>>>>>
>>>>>>>>> Also, once the interface is mostly decided, we can start 
>>>>>>>>> thinking about
>>>>>>>>> exposing the high priority queue through radv.
>>>>>>>>>
>>>>>>>>> Request for feedback:
>>>>>>>>> ---------------------
>>>>>>>>>
>>>>>>>>> We aren't married to any of the approaches outlined above. Our 
>>>>>>>>> goal
>>>>>>>>> is to
>>>>>>>>> obtain a mechanism that will allow us to complete the 
>>>>>>>>> reprojection
>>>>>>>>> job within a
>>>>>>>>> predictable amount of time. So if anyone anyone has any 
>>>>>>>>> suggestions for
>>>>>>>>> improvements or alternative strategies we are more than happy 
>>>>>>>>> to hear
>>>>>>>>> them.
>>>>>>>>>
>>>>>>>>> If any of the technical information above is also incorrect, feel
>>>>>>>>> free to point
>>>>>>>>> out my misunderstandings.
>>>>>>>>>
>>>>>>>>> Looking forward to hearing from you.
>>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>> Andres
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> amd-gfx mailing list
>>>>>>>>> amd-gfx at lists.freedesktop.org
>>>>>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> amd-gfx Info Page - lists.freedesktop.org
>>>>>>>>> lists.freedesktop.org
>>>>>>>>> To see the collection of prior postings to the list, visit the
>>>>>>>>> amd-gfx Archives. Using amd-gfx: To post a message to all the 
>>>>>>>>> list
>>>>>>>>> members, send email ...
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> amd-gfx Info Page - lists.freedesktop.org
>>>>>>>>> lists.freedesktop.org
>>>>>>>>> To see the collection of prior postings to the list, visit the
>>>>>>>>> amd-gfx Archives. Using amd-gfx: To post a message to all the 
>>>>>>>>> list
>>>>>>>>> members, send email ...
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> amd-gfx mailing list
>>>>>>>>> amd-gfx at lists.freedesktop.org
>>>>>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>>>>>>>
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> amd-gfx mailing list
>>>>>>>> amd-gfx at lists.freedesktop.org
>>>>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>>>>>
>>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> amd-gfx mailing list
>>>>> amd-gfx at lists.freedesktop.org
>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>>
>>>
>>
>> Sincerely yours,
>> Serguei Sagalovitch
>>
>> _______________________________________________
>> amd-gfx mailing list
>> amd-gfx at lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>
>