Andres, Did you measure latency, etc. impact of __any__ compositor? My understanding is that VR has pretty strict requirements related to QoS. Sincerely yours, Serguei Sagalovitch On 2016-12-22 11:35 AM, Andres Rodriguez wrote: > Hey Christian, > > We are currently interested in X, but with some distros switching to > other compositors by default, we also need to consider those. > > We agree, running the full vrcompositor in root isn't something that > we want to do. Too many security concerns. Having a small root helper > that does the privilege escalation for us is the initial idea. > > For a long term approach, Pierre-Loup and Dave are working on dealing > with the "two compositors" scenario a little better in DRM+X. > Fullscreen isn't really a sufficient approach, since we don't want the > HMD to be used as part of the Desktop environment when a VR app is not > in use (this is extremely annoying). > > When the above is settled, we should have an auth mechanism besides > DRM_MASTER or DRM_AUTH that allows the vrcompositor to take over the > HMD permanently away from X. Re-using that auth method to gate this > IOCTL is probably going to be the final solution. > > I propose to start with ROOT_ONLY since it should allow us to respect > kernel IOCTL compatibility guidelines with the most flexibility. Going > from a restrictive to a more flexible permission model would be > inclusive, but going from a general to a restrictive model may exclude > some apps that used to work. > > Regards, > Andres > > On 12/22/2016 6:42 AM, Christian König wrote: >> Hi Andres, >> >> well using root might cause stability and security problems as well. >> We worked quite hard to avoid exactly this for X. >> >> We could make this feature depend on the compositor being DRM master, >> but for example with X the X server is master (and e.g. can change >> resolutions etc..) and not the compositor. >> >> So another question is also what windowing system (if any) are you >> planning to use? X, Wayland, Flinger or something completely different ? >> >> Regards, >> Christian. >> >> Am 20.12.2016 um 16:51 schrieb Andres Rodriguez: >>> Hi Christian, >>> >>> That is definitely a concern. What we are currently thinking is to >>> make the high priority queues accessible to root only. >>> >>> Therefore is a non-root user attempts to set the high priority flag >>> on context allocation, we would fail the call and return ENOPERM. >>> >>> Regards, >>> Andres >>> >>> >>> On 12/20/2016 7:56 AM, Christian König wrote: >>>>> BTW: If there is non-VR application which will use high-priority >>>>> h/w queue then VR application will suffer. Any ideas how >>>>> to solve it? >>>> Yeah, that problem came to my mind as well. >>>> >>>> Basically we need to restrict those high priority submissions to >>>> the VR compositor or otherwise any malfunctioning application could >>>> use it. >>>> >>>> Just think about some WebGL suddenly taking all our rendering away >>>> and we won't get anything drawn any more. >>>> >>>> Alex or Michel any ideas on that? >>>> >>>> Regards, >>>> Christian. >>>> >>>> Am 19.12.2016 um 15:48 schrieb Serguei Sagalovitch: >>>>> > If compute queue is occupied only by you, the efficiency >>>>> > is equal with setting job queue to high priority I think. >>>>> The only risk is the situation when graphics will take all >>>>> needed CUs. But in any case it should be very good test. >>>>> >>>>> Andres/Pierre-Loup, >>>>> >>>>> Did you try to do it or it is a lot of work for you? >>>>> >>>>> >>>>> BTW: If there is non-VR application which will use high-priority >>>>> h/w queue then VR application will suffer. Any ideas how >>>>> to solve it? >>>>> >>>>> Sincerely yours, >>>>> Serguei Sagalovitch >>>>> >>>>> On 2016-12-19 12:50 AM, zhoucm1 wrote: >>>>>> Do you encounter the priority issue for compute queue with >>>>>> current driver? >>>>>> >>>>>> If compute queue is occupied only by you, the efficiency is equal >>>>>> with setting job queue to high priority I think. >>>>>> >>>>>> Regards, >>>>>> David Zhou >>>>>> >>>>>> On 2016å¹´12æ??19æ?¥ 13:29, Andres Rodriguez wrote: >>>>>>> Yes, vulkan is available on all-open through the mesa radv UMD. >>>>>>> >>>>>>> I'm not sure if I'm asking for too much, but if we can >>>>>>> coordinate a similar interface in radv and amdgpu-pro at the >>>>>>> vulkan level that would be great. >>>>>>> >>>>>>> I'm not sure what that's going to be yet. >>>>>>> >>>>>>> - Andres >>>>>>> >>>>>>> On 12/19/2016 12:11 AM, zhoucm1 wrote: >>>>>>>> >>>>>>>> >>>>>>>> On 2016å¹´12æ??19æ?¥ 11:33, Pierre-Loup A. Griffais wrote: >>>>>>>>> We're currently working with the open stack; I assume that a >>>>>>>>> mechanism could be exposed by both open and Pro Vulkan >>>>>>>>> userspace drivers and that the amdgpu kernel interface >>>>>>>>> improvements we would pursue following this discussion would >>>>>>>>> let both drivers take advantage of the feature, correct? >>>>>>>> Of course. >>>>>>>> Does open stack have Vulkan support? >>>>>>>> >>>>>>>> Regards, >>>>>>>> David Zhou >>>>>>>>> >>>>>>>>> On 12/18/2016 07:26 PM, zhoucm1 wrote: >>>>>>>>>> By the way, are you using all-open driver or amdgpu-pro driver? >>>>>>>>>> >>>>>>>>>> +David Mao, who is working on our Vulkan driver. >>>>>>>>>> >>>>>>>>>> Regards, >>>>>>>>>> David Zhou >>>>>>>>>> >>>>>>>>>> On 2016å¹´12æ??18æ?¥ 06:05, Pierre-Loup A. Griffais wrote: >>>>>>>>>>> Hi Serguei, >>>>>>>>>>> >>>>>>>>>>> I'm also working on the bringing up our VR runtime on top of >>>>>>>>>>> amgpu; >>>>>>>>>>> see replies inline. >>>>>>>>>>> >>>>>>>>>>> On 12/16/2016 09:05 PM, Sagalovitch, Serguei wrote: >>>>>>>>>>>> Andres, >>>>>>>>>>>> >>>>>>>>>>>>> For current VR workloads we have 3 separate processes >>>>>>>>>>>>> running >>>>>>>>>>>>> actually: >>>>>>>>>>>> So we could have potential memory overcommit case or do you do >>>>>>>>>>>> partitioning >>>>>>>>>>>> on your own? I would think that there is need to avoid >>>>>>>>>>>> overcomit in >>>>>>>>>>>> VR case to >>>>>>>>>>>> prevent any BO migration. >>>>>>>>>>> >>>>>>>>>>> You're entirely correct; currently the VR runtime is setting up >>>>>>>>>>> prioritized CPU scheduling for its VR compositor, we're >>>>>>>>>>> working on >>>>>>>>>>> prioritized GPU scheduling and pre-emption (eg. this >>>>>>>>>>> thread), and in >>>>>>>>>>> the future it will make sense to do work in order to make >>>>>>>>>>> sure that >>>>>>>>>>> its memory allocations do not get evicted, to prevent any >>>>>>>>>>> unwelcome >>>>>>>>>>> additional latency in the event of needing to perform >>>>>>>>>>> just-in-time >>>>>>>>>>> reprojection. >>>>>>>>>>> >>>>>>>>>>>> BTW: Do you mean __real__ processes or threads? >>>>>>>>>>>> Based on my understanding sharing BOs between different >>>>>>>>>>>> processes >>>>>>>>>>>> could introduce additional synchronization constrains. btw: >>>>>>>>>>>> I am not >>>>>>>>>>>> sure >>>>>>>>>>>> if we are able to share Vulkan sync. object cross-process >>>>>>>>>>>> boundary. >>>>>>>>>>> >>>>>>>>>>> They are different processes; it is important for the >>>>>>>>>>> compositor that >>>>>>>>>>> is responsible for quality-of-service features such as >>>>>>>>>>> consistently >>>>>>>>>>> presenting distorted frames with the right latency, >>>>>>>>>>> reprojection, etc, >>>>>>>>>>> to be separate from the main application. >>>>>>>>>>> >>>>>>>>>>> Currently we are using unreleased cross-process memory and >>>>>>>>>>> semaphore >>>>>>>>>>> extensions to fetch updated eye images from the client >>>>>>>>>>> application, >>>>>>>>>>> but the just-in-time reprojection discussed here does not >>>>>>>>>>> actually >>>>>>>>>>> have any direct interactions with cross-process resource >>>>>>>>>>> sharing, >>>>>>>>>>> since it's achieved by using whatever is the latest, most >>>>>>>>>>> up-to-date >>>>>>>>>>> eye images that have already been sent by the client >>>>>>>>>>> application, >>>>>>>>>>> which are already available to use without additional >>>>>>>>>>> synchronization. >>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>> 3) System compositor (we are looking at approaches to >>>>>>>>>>>>> remove this >>>>>>>>>>>>> overhead) >>>>>>>>>>>> Yes, IMHO the best is to run in "full screen mode". >>>>>>>>>>> >>>>>>>>>>> Yes, we are working on mechanisms to present directly to the >>>>>>>>>>> headset >>>>>>>>>>> display without any intermediaries as a separate effort. >>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>> The latency is our main concern, >>>>>>>>>>>> I would assume that this is the known problem (at least for >>>>>>>>>>>> compute >>>>>>>>>>>> usage). >>>>>>>>>>>> It looks like that amdgpu / kernel submission is rather CPU >>>>>>>>>>>> intensive >>>>>>>>>>>> (at least >>>>>>>>>>>> in the default configuration). >>>>>>>>>>> >>>>>>>>>>> As long as it's a consistent cost, it shouldn't an issue. >>>>>>>>>>> However, if >>>>>>>>>>> there's high degrees of variance then that would be >>>>>>>>>>> troublesome and we >>>>>>>>>>> would need to account for the worst case. >>>>>>>>>>> >>>>>>>>>>> Hopefully the requirements and approach we described make >>>>>>>>>>> sense, we're >>>>>>>>>>> looking forward to your feedback and suggestions. >>>>>>>>>>> >>>>>>>>>>> Thanks! >>>>>>>>>>> - Pierre-Loup >>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> Sincerely yours, >>>>>>>>>>>> Serguei Sagalovitch >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> From: Andres Rodriguez <andresr at valvesoftware.com> >>>>>>>>>>>> Sent: December 16, 2016 10:00 PM >>>>>>>>>>>> To: Sagalovitch, Serguei; amd-gfx at lists.freedesktop.org >>>>>>>>>>>> Subject: RE: [RFC] Mechanism for high priority scheduling >>>>>>>>>>>> in amdgpu >>>>>>>>>>>> >>>>>>>>>>>> Hey Serguei, >>>>>>>>>>>> >>>>>>>>>>>>> [Serguei] No. I mean pipe :-) as MEC define it. As far as I >>>>>>>>>>>>> understand (by simplifying) >>>>>>>>>>>>> some scheduling is per pipe. I know about the current >>>>>>>>>>>>> allocation >>>>>>>>>>>>> scheme but I do not think >>>>>>>>>>>>> that it is ideal. I would assume that we need to switch to >>>>>>>>>>>>> dynamical partition >>>>>>>>>>>>> of resources based on the workload otherwise we will have >>>>>>>>>>>>> resource >>>>>>>>>>>>> conflict >>>>>>>>>>>>> between Vulkan compute and OpenCL. >>>>>>>>>>>> >>>>>>>>>>>> I agree the partitioning isn't ideal. I'm hoping we can >>>>>>>>>>>> start with a >>>>>>>>>>>> solution that assumes that >>>>>>>>>>>> only pipe0 has any work and the other pipes are idle (no >>>>>>>>>>>> HSA/ROCm >>>>>>>>>>>> running on the system). >>>>>>>>>>>> >>>>>>>>>>>> This should be more or less the use case we expect from VR >>>>>>>>>>>> users. >>>>>>>>>>>> >>>>>>>>>>>> I agree the split is currently not ideal, but I'd like to >>>>>>>>>>>> consider >>>>>>>>>>>> that a separate task, because >>>>>>>>>>>> making it dynamic is not straight forward :P >>>>>>>>>>>> >>>>>>>>>>>>> [Serguei] Vulkan works via amdgpu (kernel submissions) so >>>>>>>>>>>>> amdkfd >>>>>>>>>>>>> will be not >>>>>>>>>>>>> involved. I would assume that in the case of VR we will >>>>>>>>>>>>> have one main >>>>>>>>>>>>> application ("console" mode(?)) so we could temporally >>>>>>>>>>>>> "ignore" >>>>>>>>>>>>> OpenCL/ROCm needs when VR is running. >>>>>>>>>>>> >>>>>>>>>>>> Correct, this is why we want to enable the high priority >>>>>>>>>>>> compute >>>>>>>>>>>> queue through >>>>>>>>>>>> libdrm-amdgpu, so that we can expose it through Vulkan later. >>>>>>>>>>>> >>>>>>>>>>>> For current VR workloads we have 3 separate processes >>>>>>>>>>>> running actually: >>>>>>>>>>>> 1) Game process >>>>>>>>>>>> 2) VR Compositor (this is the process that will require >>>>>>>>>>>> high >>>>>>>>>>>> priority queue) >>>>>>>>>>>> 3) System compositor (we are looking at approaches to >>>>>>>>>>>> remove this >>>>>>>>>>>> overhead) >>>>>>>>>>>> >>>>>>>>>>>> For now I think it is okay to assume no OpenCL/ROCm running >>>>>>>>>>>> simultaneously, but >>>>>>>>>>>> I would also like to be able to address this case in the >>>>>>>>>>>> future >>>>>>>>>>>> (cross-pipe priorities). >>>>>>>>>>>> >>>>>>>>>>>>> [Serguei] The problem with pre-emption of graphics task: >>>>>>>>>>>>> (a) it >>>>>>>>>>>>> may take time so >>>>>>>>>>>>> latency may suffer >>>>>>>>>>>> >>>>>>>>>>>> The latency is our main concern, we want something that is >>>>>>>>>>>> predictable. A good >>>>>>>>>>>> illustration of what the reprojection scheduling looks like >>>>>>>>>>>> can be >>>>>>>>>>>> found here: >>>>>>>>>>>> https://community.amd.com/servlet/JiveServlet/showImage/38-1310-104754/pastedImage_3.png >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>> (b) to preempt we need to have different "context" - we want >>>>>>>>>>>>> to guarantee that submissions from the same context will >>>>>>>>>>>>> be executed >>>>>>>>>>>>> in order. >>>>>>>>>>>> >>>>>>>>>>>> This is okay, as the reprojection work doesn't have >>>>>>>>>>>> dependencies on >>>>>>>>>>>> the game context, and it >>>>>>>>>>>> even happens in a separate process. >>>>>>>>>>>> >>>>>>>>>>>>> BTW: (a) Do you want "preempt" and later resume or do you >>>>>>>>>>>>> want >>>>>>>>>>>>> "preempt" and >>>>>>>>>>>>> "cancel/abort" >>>>>>>>>>>> >>>>>>>>>>>> Preempt the game with the compositor task and then resume it. >>>>>>>>>>>> >>>>>>>>>>>>> (b) Vulkan is generic API and could be used for graphics >>>>>>>>>>>>> as well as >>>>>>>>>>>>> for plain compute tasks (VK_QUEUE_COMPUTE_BIT). >>>>>>>>>>>> >>>>>>>>>>>> Yeah, the plan is to use vulkan compute. But if you figure >>>>>>>>>>>> out a way >>>>>>>>>>>> for us to get >>>>>>>>>>>> a guaranteed execution time using vulkan graphics, then >>>>>>>>>>>> I'll take you >>>>>>>>>>>> out for a beer :) >>>>>>>>>>>> >>>>>>>>>>>> Regards, >>>>>>>>>>>> Andres >>>>>>>>>>>> ________________________________________ >>>>>>>>>>>> From: Sagalovitch, Serguei [Serguei.Sagalovitch at amd.com] >>>>>>>>>>>> Sent: Friday, December 16, 2016 9:13 PM >>>>>>>>>>>> To: Andres Rodriguez; amd-gfx at lists.freedesktop.org >>>>>>>>>>>> Subject: Re: [RFC] Mechanism for high priority scheduling >>>>>>>>>>>> in amdgpu >>>>>>>>>>>> >>>>>>>>>>>> Hi Andres, >>>>>>>>>>>> >>>>>>>>>>>> Please see inline (as [Serguei]) >>>>>>>>>>>> >>>>>>>>>>>> Sincerely yours, >>>>>>>>>>>> Serguei Sagalovitch >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> From: Andres Rodriguez <andresr at valvesoftware.com> >>>>>>>>>>>> Sent: December 16, 2016 8:29 PM >>>>>>>>>>>> To: Sagalovitch, Serguei; amd-gfx at lists.freedesktop.org >>>>>>>>>>>> Subject: RE: [RFC] Mechanism for high priority scheduling >>>>>>>>>>>> in amdgpu >>>>>>>>>>>> >>>>>>>>>>>> Hi Serguei, >>>>>>>>>>>> >>>>>>>>>>>> Thanks for the feedback. Answers inline as [AR]. >>>>>>>>>>>> >>>>>>>>>>>> Regards, >>>>>>>>>>>> Andres >>>>>>>>>>>> >>>>>>>>>>>> ________________________________________ >>>>>>>>>>>> From: Sagalovitch, Serguei [Serguei.Sagalovitch at amd.com] >>>>>>>>>>>> Sent: Friday, December 16, 2016 8:15 PM >>>>>>>>>>>> To: Andres Rodriguez; amd-gfx at lists.freedesktop.org >>>>>>>>>>>> Subject: Re: [RFC] Mechanism for high priority scheduling >>>>>>>>>>>> in amdgpu >>>>>>>>>>>> >>>>>>>>>>>> Andres, >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> Quick comments: >>>>>>>>>>>> >>>>>>>>>>>> 1) To minimize "bubbles", etc. we need to "force" CU >>>>>>>>>>>> assignments/binding >>>>>>>>>>>> to high-priority queue when it will be in use and "free" >>>>>>>>>>>> them later >>>>>>>>>>>> (we do not want forever take CUs from e.g. graphic task to >>>>>>>>>>>> degrade >>>>>>>>>>>> graphics >>>>>>>>>>>> performance). >>>>>>>>>>>> >>>>>>>>>>>> Otherwise we could have scenario when long graphics task (or >>>>>>>>>>>> low-priority >>>>>>>>>>>> compute) will took all (extra) CUs and high--priority will >>>>>>>>>>>> wait for >>>>>>>>>>>> needed resources. >>>>>>>>>>>> It will not be visible on "NOP " but only when you submit >>>>>>>>>>>> "real" >>>>>>>>>>>> compute task >>>>>>>>>>>> so I would recommend not to use "NOP" packets at all for >>>>>>>>>>>> testing. >>>>>>>>>>>> >>>>>>>>>>>> It (CU assignment) could be relatively easy done when >>>>>>>>>>>> everything is >>>>>>>>>>>> going via kernel >>>>>>>>>>>> (e.g. as part of frame submission) but I must admit that I >>>>>>>>>>>> am not sure >>>>>>>>>>>> about the best way for user level submissions (amdkfd). >>>>>>>>>>>> >>>>>>>>>>>> [AR] I wasn't aware of this part of the programming >>>>>>>>>>>> sequence. Thanks >>>>>>>>>>>> for the heads up! >>>>>>>>>>>> Is this similar to the CU masking programming? >>>>>>>>>>>> [Serguei] Yes. To simplify: the problem is that "scheduler" >>>>>>>>>>>> when >>>>>>>>>>>> deciding which >>>>>>>>>>>> queue to run will check if there is enough resources and >>>>>>>>>>>> if not then >>>>>>>>>>>> it will begin >>>>>>>>>>>> to check other queues with lower priority. >>>>>>>>>>>> >>>>>>>>>>>> 2) I would recommend to dedicate the whole pipe to >>>>>>>>>>>> high-priority >>>>>>>>>>>> queue and have >>>>>>>>>>>> nothing their except it. >>>>>>>>>>>> >>>>>>>>>>>> [AR] I'm guessing in this context you mean pipe = queue? >>>>>>>>>>>> (as opposed >>>>>>>>>>>> to the MEC definition >>>>>>>>>>>> of pipe, which is a grouping of queues). I say this because >>>>>>>>>>>> amdgpu >>>>>>>>>>>> only has access to 1 pipe, >>>>>>>>>>>> and the rest are statically partitioned for amdkfd usage. >>>>>>>>>>>> >>>>>>>>>>>> [Serguei] No. I mean pipe :-) as MEC define it. As far as I >>>>>>>>>>>> understand (by simplifying) >>>>>>>>>>>> some scheduling is per pipe. I know about the current >>>>>>>>>>>> allocation >>>>>>>>>>>> scheme but I do not think >>>>>>>>>>>> that it is ideal. I would assume that we need to switch to >>>>>>>>>>>> dynamical partition >>>>>>>>>>>> of resources based on the workload otherwise we will have >>>>>>>>>>>> resource >>>>>>>>>>>> conflict >>>>>>>>>>>> between Vulkan compute and OpenCL. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> BTW: Which user level API do you want to use for compute: >>>>>>>>>>>> Vulkan or >>>>>>>>>>>> OpenCL? >>>>>>>>>>>> >>>>>>>>>>>> [AR] Vulkan >>>>>>>>>>>> >>>>>>>>>>>> [Serguei] Vulkan works via amdgpu (kernel submissions) so >>>>>>>>>>>> amdkfd will >>>>>>>>>>>> be not >>>>>>>>>>>> involved. I would assume that in the case of VR we will >>>>>>>>>>>> have one main >>>>>>>>>>>> application ("console" mode(?)) so we could temporally >>>>>>>>>>>> "ignore" >>>>>>>>>>>> OpenCL/ROCm needs when VR is running. >>>>>>>>>>>> >>>>>>>>>>>>> we will not be able to provide a solution compatible with >>>>>>>>>>>>> GFX >>>>>>>>>>>>> worloads. >>>>>>>>>>>> I assume that you are talking about graphics? Am I right? >>>>>>>>>>>> >>>>>>>>>>>> [AR] Yeah, my understanding is that pre-empting the >>>>>>>>>>>> currently running >>>>>>>>>>>> graphics job and scheduling in >>>>>>>>>>>> something else using mid-buffer pre-emption has some cases >>>>>>>>>>>> where it >>>>>>>>>>>> doesn't work well. But if with >>>>>>>>>>>> polaris10 it starts working well, it might be a better >>>>>>>>>>>> solution for >>>>>>>>>>>> us (because the whole reprojection >>>>>>>>>>>> work uses the vulkan graphics stack at the moment, and >>>>>>>>>>>> porting it to >>>>>>>>>>>> compute is not trivial). >>>>>>>>>>>> >>>>>>>>>>>> [Serguei] The problem with pre-emption of graphics task: >>>>>>>>>>>> (a) it may >>>>>>>>>>>> take time so >>>>>>>>>>>> latency may suffer (b) to preempt we need to have different >>>>>>>>>>>> "context" >>>>>>>>>>>> - we want >>>>>>>>>>>> to guarantee that submissions from the same context will be >>>>>>>>>>>> executed >>>>>>>>>>>> in order. >>>>>>>>>>>> BTW: (a) Do you want "preempt" and later resume or do you >>>>>>>>>>>> want >>>>>>>>>>>> "preempt" and >>>>>>>>>>>> "cancel/abort"? (b) Vulkan is generic API and could be used >>>>>>>>>>>> for graphics as well as for plain compute tasks >>>>>>>>>>>> (VK_QUEUE_COMPUTE_BIT). >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> Sincerely yours, >>>>>>>>>>>> Serguei Sagalovitch >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> From: amd-gfx <amd-gfx-bounces at lists.freedesktop.org> on >>>>>>>>>>>> behalf of >>>>>>>>>>>> Andres Rodriguez <andresr at valvesoftware.com> >>>>>>>>>>>> Sent: December 16, 2016 6:15 PM >>>>>>>>>>>> To: amd-gfx at lists.freedesktop.org >>>>>>>>>>>> Subject: [RFC] Mechanism for high priority scheduling in >>>>>>>>>>>> amdgpu >>>>>>>>>>>> >>>>>>>>>>>> Hi Everyone, >>>>>>>>>>>> >>>>>>>>>>>> This RFC is also available as a gist here: >>>>>>>>>>>> https://gist.github.com/lostgoat/7000432cd6864265dbc2c3ab93204249 >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> [RFC] Mechanism for high priority scheduling in amdgpu >>>>>>>>>>>> gist.github.com >>>>>>>>>>>> [RFC] Mechanism for high priority scheduling in amdgpu >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> [RFC] Mechanism for high priority scheduling in amdgpu >>>>>>>>>>>> gist.github.com >>>>>>>>>>>> [RFC] Mechanism for high priority scheduling in amdgpu >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> [RFC] Mechanism for high priority scheduling in amdgpu >>>>>>>>>>>> gist.github.com >>>>>>>>>>>> [RFC] Mechanism for high priority scheduling in amdgpu >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> We are interested in feedback for a mechanism to >>>>>>>>>>>> effectively schedule >>>>>>>>>>>> high >>>>>>>>>>>> priority VR reprojection tasks (also referred to as >>>>>>>>>>>> time-warping) for >>>>>>>>>>>> Polaris10 >>>>>>>>>>>> running on the amdgpu kernel driver. >>>>>>>>>>>> >>>>>>>>>>>> Brief context: >>>>>>>>>>>> -------------- >>>>>>>>>>>> >>>>>>>>>>>> The main objective of reprojection is to avoid motion >>>>>>>>>>>> sickness for VR >>>>>>>>>>>> users in >>>>>>>>>>>> scenarios where the game or application would fail to finish >>>>>>>>>>>> rendering a new >>>>>>>>>>>> frame in time for the next VBLANK. When this happens, the >>>>>>>>>>>> user's head >>>>>>>>>>>> movements >>>>>>>>>>>> are not reflected on the Head Mounted Display (HMD) for the >>>>>>>>>>>> duration >>>>>>>>>>>> of an >>>>>>>>>>>> extra frame. This extended mismatch between the inner ear >>>>>>>>>>>> and the >>>>>>>>>>>> eyes may >>>>>>>>>>>> cause the user to experience motion sickness. >>>>>>>>>>>> >>>>>>>>>>>> The VR compositor deals with this problem by fabricating a >>>>>>>>>>>> new frame >>>>>>>>>>>> using the >>>>>>>>>>>> user's updated head position in combination with the >>>>>>>>>>>> previous frames. >>>>>>>>>>>> This >>>>>>>>>>>> avoids a prolonged mismatch between the HMD output and the >>>>>>>>>>>> inner ear. >>>>>>>>>>>> >>>>>>>>>>>> Because of the adverse effects on the user, we require high >>>>>>>>>>>> confidence that the >>>>>>>>>>>> reprojection task will complete before the VBLANK interval. >>>>>>>>>>>> Even if >>>>>>>>>>>> the GFX pipe >>>>>>>>>>>> is currently full of work from the game/application (which >>>>>>>>>>>> is most >>>>>>>>>>>> likely the case). >>>>>>>>>>>> >>>>>>>>>>>> For more details and illustrations, please refer to the >>>>>>>>>>>> following >>>>>>>>>>>> document: >>>>>>>>>>>> https://community.amd.com/community/gaming/blog/2016/03/28/asynchronous-shaders-evolved >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> Gaming: Asynchronous Shaders Evolved | Community >>>>>>>>>>>> community.amd.com >>>>>>>>>>>> One of the most exciting new developments in GPU technology >>>>>>>>>>>> over the >>>>>>>>>>>> past year has been the adoption of asynchronous shaders, >>>>>>>>>>>> which can >>>>>>>>>>>> make more efficient use of ... >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> Gaming: Asynchronous Shaders Evolved | Community >>>>>>>>>>>> community.amd.com >>>>>>>>>>>> One of the most exciting new developments in GPU technology >>>>>>>>>>>> over the >>>>>>>>>>>> past year has been the adoption of asynchronous shaders, >>>>>>>>>>>> which can >>>>>>>>>>>> make more efficient use of ... >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> Gaming: Asynchronous Shaders Evolved | Community >>>>>>>>>>>> community.amd.com >>>>>>>>>>>> One of the most exciting new developments in GPU technology >>>>>>>>>>>> over the >>>>>>>>>>>> past year has been the adoption of asynchronous shaders, >>>>>>>>>>>> which can >>>>>>>>>>>> make more efficient use of ... >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> Requirements: >>>>>>>>>>>> ------------- >>>>>>>>>>>> >>>>>>>>>>>> The mechanism must expose the following functionaility: >>>>>>>>>>>> >>>>>>>>>>>> * Job round trip time must be predictable, from >>>>>>>>>>>> submission to >>>>>>>>>>>> fence signal >>>>>>>>>>>> >>>>>>>>>>>> * The mechanism must support compute workloads. >>>>>>>>>>>> >>>>>>>>>>>> Goals: >>>>>>>>>>>> ------ >>>>>>>>>>>> >>>>>>>>>>>> * The mechanism should provide low submission latencies >>>>>>>>>>>> >>>>>>>>>>>> Test: submitting a NOP packet through the mechanism on busy >>>>>>>>>>>> hardware >>>>>>>>>>>> should >>>>>>>>>>>> be equivalent to submitting a NOP on idle hardware. >>>>>>>>>>>> >>>>>>>>>>>> Nice to have: >>>>>>>>>>>> ------------- >>>>>>>>>>>> >>>>>>>>>>>> * The mechanism should also support GFX workloads. >>>>>>>>>>>> >>>>>>>>>>>> My understanding is that with the current hardware >>>>>>>>>>>> capabilities in >>>>>>>>>>>> Polaris10 we >>>>>>>>>>>> will not be able to provide a solution compatible with GFX >>>>>>>>>>>> worloads. >>>>>>>>>>>> >>>>>>>>>>>> But I would love to hear otherwise. So if anyone has an idea, >>>>>>>>>>>> approach or >>>>>>>>>>>> suggestion that will also be compatible with the GFX ring, >>>>>>>>>>>> please let >>>>>>>>>>>> us know >>>>>>>>>>>> about it. >>>>>>>>>>>> >>>>>>>>>>>> * The above guarantees should also be respected by >>>>>>>>>>>> amdkfd workloads >>>>>>>>>>>> >>>>>>>>>>>> Would be good to have for consistency, but not strictly >>>>>>>>>>>> necessary as >>>>>>>>>>>> users running >>>>>>>>>>>> games are not traditionally running HPC workloads in the >>>>>>>>>>>> background. >>>>>>>>>>>> >>>>>>>>>>>> Proposed approach: >>>>>>>>>>>> ------------------ >>>>>>>>>>>> >>>>>>>>>>>> Similar to the windows driver, we could expose a high priority >>>>>>>>>>>> compute queue to >>>>>>>>>>>> userspace. >>>>>>>>>>>> >>>>>>>>>>>> Submissions to this compute queue will be scheduled with high >>>>>>>>>>>> priority, and may >>>>>>>>>>>> acquire hardware resources previously in use by other queues. >>>>>>>>>>>> >>>>>>>>>>>> This can be achieved by taking advantage of the 'priority' >>>>>>>>>>>> field in >>>>>>>>>>>> the HQDs >>>>>>>>>>>> and could be programmed by amdgpu or the amdgpu scheduler. >>>>>>>>>>>> The relevant >>>>>>>>>>>> register fields are: >>>>>>>>>>>> * mmCP_HQD_PIPE_PRIORITY >>>>>>>>>>>> * mmCP_HQD_QUEUE_PRIORITY >>>>>>>>>>>> >>>>>>>>>>>> Implementation approach 1 - static partitioning: >>>>>>>>>>>> ------------------------------------------------ >>>>>>>>>>>> >>>>>>>>>>>> The amdgpu driver currently controls 8 compute queues from >>>>>>>>>>>> pipe0. We can >>>>>>>>>>>> statically partition these as follows: >>>>>>>>>>>> * 7x regular >>>>>>>>>>>> * 1x high priority >>>>>>>>>>>> >>>>>>>>>>>> The relevant priorities can be set so that submissions to >>>>>>>>>>>> the high >>>>>>>>>>>> priority >>>>>>>>>>>> ring will starve the other compute rings and the GFX ring. >>>>>>>>>>>> >>>>>>>>>>>> The amdgpu scheduler will only place jobs into the high >>>>>>>>>>>> priority >>>>>>>>>>>> rings if the >>>>>>>>>>>> context is marked as high priority. And a corresponding >>>>>>>>>>>> priority >>>>>>>>>>>> should be >>>>>>>>>>>> added to keep track of this information: >>>>>>>>>>>> * AMD_SCHED_PRIORITY_KERNEL >>>>>>>>>>>> * -> AMD_SCHED_PRIORITY_HIGH >>>>>>>>>>>> * AMD_SCHED_PRIORITY_NORMAL >>>>>>>>>>>> >>>>>>>>>>>> The user will request a high priority context by setting an >>>>>>>>>>>> appropriate flag >>>>>>>>>>>> in drm_amdgpu_ctx_in (AMDGPU_CTX_HIGH_PRIORITY or similar): >>>>>>>>>>>> https://github.com/torvalds/linux/blob/master/include/uapi/drm/amdgpu_drm.h#L163 >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> The setting is in a per context level so that we can: >>>>>>>>>>>> * Maintain a consistent FIFO ordering of all >>>>>>>>>>>> submissions to a >>>>>>>>>>>> context >>>>>>>>>>>> * Create high priority and non-high priority contexts >>>>>>>>>>>> in the same >>>>>>>>>>>> process >>>>>>>>>>>> >>>>>>>>>>>> Implementation approach 2 - dynamic priority programming: >>>>>>>>>>>> --------------------------------------------------------- >>>>>>>>>>>> >>>>>>>>>>>> Similar to the above, but instead of programming the >>>>>>>>>>>> priorities and >>>>>>>>>>>> amdgpu_init() time, the SW scheduler will reprogram the >>>>>>>>>>>> queue priorities >>>>>>>>>>>> dynamically when scheduling a task. >>>>>>>>>>>> >>>>>>>>>>>> This would involve having a hardware specific callback from >>>>>>>>>>>> the >>>>>>>>>>>> scheduler to >>>>>>>>>>>> set the appropriate queue priority: set_priority(int ring, >>>>>>>>>>>> int index, >>>>>>>>>>>> int priority) >>>>>>>>>>>> >>>>>>>>>>>> During this callback we would have to grab the SRBM mutex >>>>>>>>>>>> to perform >>>>>>>>>>>> the appropriate >>>>>>>>>>>> HW programming, and I'm not really sure if that is >>>>>>>>>>>> something we >>>>>>>>>>>> should be doing from >>>>>>>>>>>> the scheduler. >>>>>>>>>>>> >>>>>>>>>>>> On the positive side, this approach would allow us to >>>>>>>>>>>> program a range of >>>>>>>>>>>> priorities for jobs instead of a single "high priority" >>>>>>>>>>>> value", >>>>>>>>>>>> achieving >>>>>>>>>>>> something similar to the niceness API available for CPU >>>>>>>>>>>> scheduling. >>>>>>>>>>>> >>>>>>>>>>>> I'm not sure if this flexibility is something that we would >>>>>>>>>>>> need for >>>>>>>>>>>> our use >>>>>>>>>>>> case, but it might be useful in other scenarios (multiple >>>>>>>>>>>> users >>>>>>>>>>>> sharing compute >>>>>>>>>>>> time on a server). >>>>>>>>>>>> >>>>>>>>>>>> This approach would require a new int field in >>>>>>>>>>>> drm_amdgpu_ctx_in, or >>>>>>>>>>>> repurposing >>>>>>>>>>>> of the flags field. >>>>>>>>>>>> >>>>>>>>>>>> Known current obstacles: >>>>>>>>>>>> ------------------------ >>>>>>>>>>>> >>>>>>>>>>>> The SQ is currently programmed to disregard the HQD >>>>>>>>>>>> priorities, and >>>>>>>>>>>> instead it picks >>>>>>>>>>>> jobs at random. Settings from the shader itself are also >>>>>>>>>>>> disregarded >>>>>>>>>>>> as this is >>>>>>>>>>>> considered a privileged field. >>>>>>>>>>>> >>>>>>>>>>>> Effectively we can get our compute wavefront launched ASAP, >>>>>>>>>>>> but we >>>>>>>>>>>> might not get the >>>>>>>>>>>> time we need on the SQ. >>>>>>>>>>>> >>>>>>>>>>>> The current programming would have to be changed to allow >>>>>>>>>>>> priority >>>>>>>>>>>> propagation >>>>>>>>>>>> from the HQD into the SQ. >>>>>>>>>>>> >>>>>>>>>>>> Generic approach for all HW IPs: >>>>>>>>>>>> -------------------------------- >>>>>>>>>>>> >>>>>>>>>>>> For consistency purposes, the high priority context can be >>>>>>>>>>>> enabled >>>>>>>>>>>> for all HW IPs >>>>>>>>>>>> with support of the SW scheduler. This will function >>>>>>>>>>>> similarly to the >>>>>>>>>>>> current >>>>>>>>>>>> AMD_SCHED_PRIORITY_KERNEL priority, where the job can jump >>>>>>>>>>>> ahead of >>>>>>>>>>>> anything not >>>>>>>>>>>> commited to the HW queue. >>>>>>>>>>>> >>>>>>>>>>>> The benefits of requesting a high priority context for a >>>>>>>>>>>> non-compute >>>>>>>>>>>> queue will >>>>>>>>>>>> be lesser (e.g. up to 10s of wait time if a GFX command is >>>>>>>>>>>> stuck in >>>>>>>>>>>> front of >>>>>>>>>>>> you), but having the API in place will allow us to easily >>>>>>>>>>>> improve the >>>>>>>>>>>> implementation >>>>>>>>>>>> in the future as new features become available in new >>>>>>>>>>>> hardware. >>>>>>>>>>>> >>>>>>>>>>>> Future steps: >>>>>>>>>>>> ------------- >>>>>>>>>>>> >>>>>>>>>>>> Once we have an approach settled, I can take care of the >>>>>>>>>>>> implementation. >>>>>>>>>>>> >>>>>>>>>>>> Also, once the interface is mostly decided, we can start >>>>>>>>>>>> thinking about >>>>>>>>>>>> exposing the high priority queue through radv. >>>>>>>>>>>> >>>>>>>>>>>> Request for feedback: >>>>>>>>>>>> --------------------- >>>>>>>>>>>> >>>>>>>>>>>> We aren't married to any of the approaches outlined above. >>>>>>>>>>>> Our goal >>>>>>>>>>>> is to >>>>>>>>>>>> obtain a mechanism that will allow us to complete the >>>>>>>>>>>> reprojection >>>>>>>>>>>> job within a >>>>>>>>>>>> predictable amount of time. So if anyone anyone has any >>>>>>>>>>>> suggestions for >>>>>>>>>>>> improvements or alternative strategies we are more than >>>>>>>>>>>> happy to hear >>>>>>>>>>>> them. >>>>>>>>>>>> >>>>>>>>>>>> If any of the technical information above is also >>>>>>>>>>>> incorrect, feel >>>>>>>>>>>> free to point >>>>>>>>>>>> out my misunderstandings. >>>>>>>>>>>> >>>>>>>>>>>> Looking forward to hearing from you. >>>>>>>>>>>> >>>>>>>>>>>> Regards, >>>>>>>>>>>> Andres >>>>>>>>>>>> >>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>> amd-gfx mailing list >>>>>>>>>>>> amd-gfx at lists.freedesktop.org >>>>>>>>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> amd-gfx Info Page - lists.freedesktop.org >>>>>>>>>>>> lists.freedesktop.org >>>>>>>>>>>> To see the collection of prior postings to the list, visit the >>>>>>>>>>>> amd-gfx Archives. Using amd-gfx: To post a message to all >>>>>>>>>>>> the list >>>>>>>>>>>> members, send email ... >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> amd-gfx Info Page - lists.freedesktop.org >>>>>>>>>>>> lists.freedesktop.org >>>>>>>>>>>> To see the collection of prior postings to the list, visit the >>>>>>>>>>>> amd-gfx Archives. Using amd-gfx: To post a message to all >>>>>>>>>>>> the list >>>>>>>>>>>> members, send email ... >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>> amd-gfx mailing list >>>>>>>>>>>> amd-gfx at lists.freedesktop.org >>>>>>>>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> _______________________________________________ >>>>>>>>>>> amd-gfx mailing list >>>>>>>>>>> amd-gfx at lists.freedesktop.org >>>>>>>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> amd-gfx mailing list >>>>>>>> amd-gfx at lists.freedesktop.org >>>>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx >>>>>>> >>>>>> >>>>> >>>>> Sincerely yours, >>>>> Serguei Sagalovitch >>>>> >>>>> _______________________________________________ >>>>> amd-gfx mailing list >>>>> amd-gfx at lists.freedesktop.org >>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx >>>> >>>> >>> >> > Sincerely yours, Serguei Sagalovitch