I'm actually testing that out that today. Example prototype patch: https://github.com/lostgoat/linux/commit/c9d88d409d8655d63aa8386edc66687a2ba64a12 My goal is to first implement this approach, then slowly work my way towards the HW level optimizations. The problem I expect to see with this approach is that there will still be unpredictably long latencies depending on what has been committed to the HW rings. But it is definitely a good start. Regards, Andres On Fri, Dec 23, 2016 at 11:20 AM, Bridgman, John <John.Bridgman at amd.com> wrote: > One question I just remembered - the amdgpu driver includes some scheduler > logic which maintains per-process queues and therefore avoids loading up > the primary ring with a ton of work. > > > Has there been any experimentation with injecting priorities at that level > rather than jumping straight to HW-level changes ? > ------------------------------ > *From:* amd-gfx <amd-gfx-bounces at lists.freedesktop.org> on behalf of > Andres Rodriguez <andresx7 at gmail.com> > *Sent:* December 23, 2016 11:13 AM > *To:* Koenig, Christian > *Cc:* Zhou, David(ChunMing); Huan, Alvin; Mao, David; Sagalovitch, > Serguei; amd-gfx at lists.freedesktop.org; Andres Rodriguez; Pierre-Loup A. > Griffais; Zhang, Hawking > > *Subject:* Re: [RFC] Mechanism for high priority scheduling in amdgpu > > Hey Christian, > > But yes, in general you don't want another compositor in the way, so we'll >>> be acquiring the HMD display directly, separate from any desktop or display >>> server. >> >> >> Assuming that the the HMD is attached to the rendering device in some way >> you have the X server and the Compositor which both try to be DRM master at >> the same time. >> >> Please correct me if that was fixed in the meantime, but that sounds like >> it will simply not work. Or is this what Andres mention below Dave is >> working on ?. >> > > You are correct on both statements. We can't have two DRM_MASTERs, so the > current DRM+X does not support this use case. And this what Dave and > Pierre-Loup are currently working on. > > Additional to that a compositor in combination with X is a bit counter >> productive when you want to keep the latency low. >> > > One thing I'd like to correct is that our main goal is to get latency > _predictable_, secondary goal is to make it low. > > The high priority queue feature addresses our main source of > unpredictability: the scheduling latency when the hardware is already full > of work from the game engine. > > The DirectMode feature addresses one of the latency sources: multiple > (unnecessary) context switches to submit a surface to the DRM driver. > > Targeting something like Wayland and when you need X compatibility >> XWayland sounds like the much better idea. >> > > We are pretty enthusiastic about Wayland (and really glad to see Fedora 25 > use Wayland by default). Once we have everything working nicely under X > (where most of the users are currently), I'm sure Pierre-Loup will be > pushing us to get everything optimized under Wayland as well (which should > be a lot simpler!). > > Ever since working with SurfaceFlinger on Android with explicit fencing > I've been waiting for the day I can finally ditch X altogether :) > > Regards, > Andres > > > On Fri, Dec 23, 2016 at 5:54 AM, Christian König <christian.koenig at amd.com > > wrote: > >> But yes, in general you don't want another compositor in the way, so >>> we'll be acquiring the HMD display directly, separate from any desktop or >>> display server. >>> >> Assuming that the the HMD is attached to the rendering device in some way >> you have the X server and the Compositor which both try to be DRM master at >> the same time. >> >> Please correct me if that was fixed in the meantime, but that sounds like >> it will simply not work. Or is this what Andres mention below Dave is >> working on ?. >> >> Additional to that a compositor in combination with X is a bit counter >> productive when you want to keep the latency low. >> >> E.g. the "normal" flow of a GL or Vulkan surface filled with rendered >> data to be displayed is from the Application -> X server -> compositor -> X >> server. >> >> The extra step between X server and compositor just means extra latency >> and for this use case you probably don't want that. >> >> Targeting something like Wayland and when you need X compatibility >> XWayland sounds like the much better idea. >> >> Regards, >> Christian. >> >> >> Am 22.12.2016 um 20:54 schrieb Pierre-Loup A. Griffais: >> >>> Display concerns are a separate issue, and as Andres said we have other >>> plans to address. But yes, in general you don't want another compositor in >>> the way, so we'll be acquiring the HMD display directly, separate from any >>> desktop or display server. Same with security, we can have a separate >>> conversation about that when the time comes. >>> >>> On 12/22/2016 08:41 AM, Serguei Sagalovitch wrote: >>> >>>> Andres, >>>> >>>> Did you measure latency, etc. impact of __any__ compositor? >>>> >>>> My understanding is that VR has pretty strict requirements related to >>>> QoS. >>>> >>>> Sincerely yours, >>>> Serguei Sagalovitch >>>> >>>> >>>> On 2016-12-22 11:35 AM, Andres Rodriguez wrote: >>>> >>>>> Hey Christian, >>>>> >>>>> We are currently interested in X, but with some distros switching to >>>>> other compositors by default, we also need to consider those. >>>>> >>>>> We agree, running the full vrcompositor in root isn't something that >>>>> we want to do. Too many security concerns. Having a small root helper >>>>> that does the privilege escalation for us is the initial idea. >>>>> >>>>> For a long term approach, Pierre-Loup and Dave are working on dealing >>>>> with the "two compositors" scenario a little better in DRM+X. >>>>> Fullscreen isn't really a sufficient approach, since we don't want the >>>>> HMD to be used as part of the Desktop environment when a VR app is not >>>>> in use (this is extremely annoying). >>>>> >>>>> When the above is settled, we should have an auth mechanism besides >>>>> DRM_MASTER or DRM_AUTH that allows the vrcompositor to take over the >>>>> HMD permanently away from X. Re-using that auth method to gate this >>>>> IOCTL is probably going to be the final solution. >>>>> >>>>> I propose to start with ROOT_ONLY since it should allow us to respect >>>>> kernel IOCTL compatibility guidelines with the most flexibility. Going >>>>> from a restrictive to a more flexible permission model would be >>>>> inclusive, but going from a general to a restrictive model may exclude >>>>> some apps that used to work. >>>>> >>>>> Regards, >>>>> Andres >>>>> >>>>> On 12/22/2016 6:42 AM, Christian König wrote: >>>>> >>>>>> Hi Andres, >>>>>> >>>>>> well using root might cause stability and security problems as well. >>>>>> We worked quite hard to avoid exactly this for X. >>>>>> >>>>>> We could make this feature depend on the compositor being DRM master, >>>>>> but for example with X the X server is master (and e.g. can change >>>>>> resolutions etc..) and not the compositor. >>>>>> >>>>>> So another question is also what windowing system (if any) are you >>>>>> planning to use? X, Wayland, Flinger or something completely >>>>>> different ? >>>>>> >>>>>> Regards, >>>>>> Christian. >>>>>> >>>>>> Am 20.12.2016 um 16:51 schrieb Andres Rodriguez: >>>>>> >>>>>>> Hi Christian, >>>>>>> >>>>>>> That is definitely a concern. What we are currently thinking is to >>>>>>> make the high priority queues accessible to root only. >>>>>>> >>>>>>> Therefore is a non-root user attempts to set the high priority flag >>>>>>> on context allocation, we would fail the call and return ENOPERM. >>>>>>> >>>>>>> Regards, >>>>>>> Andres >>>>>>> >>>>>>> >>>>>>> On 12/20/2016 7:56 AM, Christian König wrote: >>>>>>> >>>>>>>> BTW: If there is non-VR application which will use high-priority >>>>>>>>> h/w queue then VR application will suffer. Any ideas how >>>>>>>>> to solve it? >>>>>>>>> >>>>>>>> Yeah, that problem came to my mind as well. >>>>>>>> >>>>>>>> Basically we need to restrict those high priority submissions to >>>>>>>> the VR compositor or otherwise any malfunctioning application could >>>>>>>> use it. >>>>>>>> >>>>>>>> Just think about some WebGL suddenly taking all our rendering away >>>>>>>> and we won't get anything drawn any more. >>>>>>>> >>>>>>>> Alex or Michel any ideas on that? >>>>>>>> >>>>>>>> Regards, >>>>>>>> Christian. >>>>>>>> >>>>>>>> Am 19.12.2016 um 15:48 schrieb Serguei Sagalovitch: >>>>>>>> >>>>>>>>> > If compute queue is occupied only by you, the efficiency >>>>>>>>> > is equal with setting job queue to high priority I think. >>>>>>>>> The only risk is the situation when graphics will take all >>>>>>>>> needed CUs. But in any case it should be very good test. >>>>>>>>> >>>>>>>>> Andres/Pierre-Loup, >>>>>>>>> >>>>>>>>> Did you try to do it or it is a lot of work for you? >>>>>>>>> >>>>>>>>> >>>>>>>>> BTW: If there is non-VR application which will use high-priority >>>>>>>>> h/w queue then VR application will suffer. Any ideas how >>>>>>>>> to solve it? >>>>>>>>> >>>>>>>>> Sincerely yours, >>>>>>>>> Serguei Sagalovitch >>>>>>>>> >>>>>>>>> On 2016-12-19 12:50 AM, zhoucm1 wrote: >>>>>>>>> >>>>>>>>>> Do you encounter the priority issue for compute queue with >>>>>>>>>> current driver? >>>>>>>>>> >>>>>>>>>> If compute queue is occupied only by you, the efficiency is equal >>>>>>>>>> with setting job queue to high priority I think. >>>>>>>>>> >>>>>>>>>> Regards, >>>>>>>>>> David Zhou >>>>>>>>>> >>>>>>>>>> On 2016å¹´12æ??19æ?¥ 13:29, Andres Rodriguez wrote: >>>>>>>>>> >>>>>>>>>>> Yes, vulkan is available on all-open through the mesa radv UMD. >>>>>>>>>>> >>>>>>>>>>> I'm not sure if I'm asking for too much, but if we can >>>>>>>>>>> coordinate a similar interface in radv and amdgpu-pro at the >>>>>>>>>>> vulkan level that would be great. >>>>>>>>>>> >>>>>>>>>>> I'm not sure what that's going to be yet. >>>>>>>>>>> >>>>>>>>>>> - Andres >>>>>>>>>>> >>>>>>>>>>> On 12/19/2016 12:11 AM, zhoucm1 wrote: >>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On 2016å¹´12æ??19æ?¥ 11:33, Pierre-Loup A. Griffais wrote: >>>>>>>>>>>> >>>>>>>>>>>>> We're currently working with the open stack; I assume that a >>>>>>>>>>>>> mechanism could be exposed by both open and Pro Vulkan >>>>>>>>>>>>> userspace drivers and that the amdgpu kernel interface >>>>>>>>>>>>> improvements we would pursue following this discussion would >>>>>>>>>>>>> let both drivers take advantage of the feature, correct? >>>>>>>>>>>>> >>>>>>>>>>>> Of course. >>>>>>>>>>>> Does open stack have Vulkan support? >>>>>>>>>>>> >>>>>>>>>>>> Regards, >>>>>>>>>>>> David Zhou >>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> On 12/18/2016 07:26 PM, zhoucm1 wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> By the way, are you using all-open driver or amdgpu-pro >>>>>>>>>>>>>> driver? >>>>>>>>>>>>>> >>>>>>>>>>>>>> +David Mao, who is working on our Vulkan driver. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Regards, >>>>>>>>>>>>>> David Zhou >>>>>>>>>>>>>> >>>>>>>>>>>>>> On 2016å¹´12æ??18æ?¥ 06:05, Pierre-Loup A. Griffais wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> Hi Serguei, >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> I'm also working on the bringing up our VR runtime on top of >>>>>>>>>>>>>>> amgpu; >>>>>>>>>>>>>>> see replies inline. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On 12/16/2016 09:05 PM, Sagalovitch, Serguei wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Andres, >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> For current VR workloads we have 3 separate processes >>>>>>>>>>>>>>>>> running >>>>>>>>>>>>>>>>> actually: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> So we could have potential memory overcommit case or do you >>>>>>>>>>>>>>>> do >>>>>>>>>>>>>>>> partitioning >>>>>>>>>>>>>>>> on your own? I would think that there is need to avoid >>>>>>>>>>>>>>>> overcomit in >>>>>>>>>>>>>>>> VR case to >>>>>>>>>>>>>>>> prevent any BO migration. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> You're entirely correct; currently the VR runtime is setting >>>>>>>>>>>>>>> up >>>>>>>>>>>>>>> prioritized CPU scheduling for its VR compositor, we're >>>>>>>>>>>>>>> working on >>>>>>>>>>>>>>> prioritized GPU scheduling and pre-emption (eg. this >>>>>>>>>>>>>>> thread), and in >>>>>>>>>>>>>>> the future it will make sense to do work in order to make >>>>>>>>>>>>>>> sure that >>>>>>>>>>>>>>> its memory allocations do not get evicted, to prevent any >>>>>>>>>>>>>>> unwelcome >>>>>>>>>>>>>>> additional latency in the event of needing to perform >>>>>>>>>>>>>>> just-in-time >>>>>>>>>>>>>>> reprojection. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> BTW: Do you mean __real__ processes or threads? >>>>>>>>>>>>>>>> Based on my understanding sharing BOs between different >>>>>>>>>>>>>>>> processes >>>>>>>>>>>>>>>> could introduce additional synchronization constrains. btw: >>>>>>>>>>>>>>>> I am not >>>>>>>>>>>>>>>> sure >>>>>>>>>>>>>>>> if we are able to share Vulkan sync. object cross-process >>>>>>>>>>>>>>>> boundary. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> They are different processes; it is important for the >>>>>>>>>>>>>>> compositor that >>>>>>>>>>>>>>> is responsible for quality-of-service features such as >>>>>>>>>>>>>>> consistently >>>>>>>>>>>>>>> presenting distorted frames with the right latency, >>>>>>>>>>>>>>> reprojection, etc, >>>>>>>>>>>>>>> to be separate from the main application. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Currently we are using unreleased cross-process memory and >>>>>>>>>>>>>>> semaphore >>>>>>>>>>>>>>> extensions to fetch updated eye images from the client >>>>>>>>>>>>>>> application, >>>>>>>>>>>>>>> but the just-in-time reprojection discussed here does not >>>>>>>>>>>>>>> actually >>>>>>>>>>>>>>> have any direct interactions with cross-process resource >>>>>>>>>>>>>>> sharing, >>>>>>>>>>>>>>> since it's achieved by using whatever is the latest, most >>>>>>>>>>>>>>> up-to-date >>>>>>>>>>>>>>> eye images that have already been sent by the client >>>>>>>>>>>>>>> application, >>>>>>>>>>>>>>> which are already available to use without additional >>>>>>>>>>>>>>> synchronization. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> 3) System compositor (we are looking at approaches to >>>>>>>>>>>>>>>>> remove this >>>>>>>>>>>>>>>>> overhead) >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Yes, IMHO the best is to run in "full screen mode". >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Yes, we are working on mechanisms to present directly to the >>>>>>>>>>>>>>> headset >>>>>>>>>>>>>>> display without any intermediaries as a separate effort. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> The latency is our main concern, >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> I would assume that this is the known problem (at least for >>>>>>>>>>>>>>>> compute >>>>>>>>>>>>>>>> usage). >>>>>>>>>>>>>>>> It looks like that amdgpu / kernel submission is rather CPU >>>>>>>>>>>>>>>> intensive >>>>>>>>>>>>>>>> (at least >>>>>>>>>>>>>>>> in the default configuration). >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> As long as it's a consistent cost, it shouldn't an issue. >>>>>>>>>>>>>>> However, if >>>>>>>>>>>>>>> there's high degrees of variance then that would be >>>>>>>>>>>>>>> troublesome and we >>>>>>>>>>>>>>> would need to account for the worst case. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Hopefully the requirements and approach we described make >>>>>>>>>>>>>>> sense, we're >>>>>>>>>>>>>>> looking forward to your feedback and suggestions. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Thanks! >>>>>>>>>>>>>>> - Pierre-Loup >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Sincerely yours, >>>>>>>>>>>>>>>> Serguei Sagalovitch >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> From: Andres Rodriguez <andresr at valvesoftware.com> >>>>>>>>>>>>>>>> Sent: December 16, 2016 10:00 PM >>>>>>>>>>>>>>>> To: Sagalovitch, Serguei; amd-gfx at lists.freedesktop.org >>>>>>>>>>>>>>>> Subject: RE: [RFC] Mechanism for high priority scheduling >>>>>>>>>>>>>>>> in amdgpu >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Hey Serguei, >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> [Serguei] No. I mean pipe :-) as MEC define it. As far as I >>>>>>>>>>>>>>>>> understand (by simplifying) >>>>>>>>>>>>>>>>> some scheduling is per pipe. I know about the current >>>>>>>>>>>>>>>>> allocation >>>>>>>>>>>>>>>>> scheme but I do not think >>>>>>>>>>>>>>>>> that it is ideal. I would assume that we need to switch >>>>>>>>>>>>>>>>> to >>>>>>>>>>>>>>>>> dynamical partition >>>>>>>>>>>>>>>>> of resources based on the workload otherwise we will have >>>>>>>>>>>>>>>>> resource >>>>>>>>>>>>>>>>> conflict >>>>>>>>>>>>>>>>> between Vulkan compute and OpenCL. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> I agree the partitioning isn't ideal. I'm hoping we can >>>>>>>>>>>>>>>> start with a >>>>>>>>>>>>>>>> solution that assumes that >>>>>>>>>>>>>>>> only pipe0 has any work and the other pipes are idle (no >>>>>>>>>>>>>>>> HSA/ROCm >>>>>>>>>>>>>>>> running on the system). >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> This should be more or less the use case we expect from VR >>>>>>>>>>>>>>>> users. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> I agree the split is currently not ideal, but I'd like to >>>>>>>>>>>>>>>> consider >>>>>>>>>>>>>>>> that a separate task, because >>>>>>>>>>>>>>>> making it dynamic is not straight forward :P >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> [Serguei] Vulkan works via amdgpu (kernel submissions) so >>>>>>>>>>>>>>>>> amdkfd >>>>>>>>>>>>>>>>> will be not >>>>>>>>>>>>>>>>> involved. I would assume that in the case of VR we will >>>>>>>>>>>>>>>>> have one main >>>>>>>>>>>>>>>>> application ("console" mode(?)) so we could temporally >>>>>>>>>>>>>>>>> "ignore" >>>>>>>>>>>>>>>>> OpenCL/ROCm needs when VR is running. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Correct, this is why we want to enable the high priority >>>>>>>>>>>>>>>> compute >>>>>>>>>>>>>>>> queue through >>>>>>>>>>>>>>>> libdrm-amdgpu, so that we can expose it through Vulkan >>>>>>>>>>>>>>>> later. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> For current VR workloads we have 3 separate processes >>>>>>>>>>>>>>>> running actually: >>>>>>>>>>>>>>>> 1) Game process >>>>>>>>>>>>>>>> 2) VR Compositor (this is the process that will require >>>>>>>>>>>>>>>> high >>>>>>>>>>>>>>>> priority queue) >>>>>>>>>>>>>>>> 3) System compositor (we are looking at approaches to >>>>>>>>>>>>>>>> remove this >>>>>>>>>>>>>>>> overhead) >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> For now I think it is okay to assume no OpenCL/ROCm running >>>>>>>>>>>>>>>> simultaneously, but >>>>>>>>>>>>>>>> I would also like to be able to address this case in the >>>>>>>>>>>>>>>> future >>>>>>>>>>>>>>>> (cross-pipe priorities). >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> [Serguei] The problem with pre-emption of graphics task: >>>>>>>>>>>>>>>>> (a) it >>>>>>>>>>>>>>>>> may take time so >>>>>>>>>>>>>>>>> latency may suffer >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> The latency is our main concern, we want something that is >>>>>>>>>>>>>>>> predictable. A good >>>>>>>>>>>>>>>> illustration of what the reprojection scheduling looks like >>>>>>>>>>>>>>>> can be >>>>>>>>>>>>>>>> found here: >>>>>>>>>>>>>>>> https://community.amd.com/serv >>>>>>>>>>>>>>>> let/JiveServlet/showImage/38-1310-104754/pastedImage_3.png >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> (b) to preempt we need to have different "context" - we want >>>>>>>>>>>>>>>>> to guarantee that submissions from the same context will >>>>>>>>>>>>>>>>> be executed >>>>>>>>>>>>>>>>> in order. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> This is okay, as the reprojection work doesn't have >>>>>>>>>>>>>>>> dependencies on >>>>>>>>>>>>>>>> the game context, and it >>>>>>>>>>>>>>>> even happens in a separate process. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> BTW: (a) Do you want "preempt" and later resume or do you >>>>>>>>>>>>>>>>> want >>>>>>>>>>>>>>>>> "preempt" and >>>>>>>>>>>>>>>>> "cancel/abort" >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Preempt the game with the compositor task and then resume >>>>>>>>>>>>>>>> it. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> (b) Vulkan is generic API and could be used for graphics >>>>>>>>>>>>>>>>> as well as >>>>>>>>>>>>>>>>> for plain compute tasks (VK_QUEUE_COMPUTE_BIT). >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Yeah, the plan is to use vulkan compute. But if you figure >>>>>>>>>>>>>>>> out a way >>>>>>>>>>>>>>>> for us to get >>>>>>>>>>>>>>>> a guaranteed execution time using vulkan graphics, then >>>>>>>>>>>>>>>> I'll take you >>>>>>>>>>>>>>>> out for a beer :) >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Regards, >>>>>>>>>>>>>>>> Andres >>>>>>>>>>>>>>>> ________________________________________ >>>>>>>>>>>>>>>> From: Sagalovitch, Serguei [Serguei.Sagalovitch at amd.com] >>>>>>>>>>>>>>>> Sent: Friday, December 16, 2016 9:13 PM >>>>>>>>>>>>>>>> To: Andres Rodriguez; amd-gfx at lists.freedesktop.org >>>>>>>>>>>>>>>> Subject: Re: [RFC] Mechanism for high priority scheduling >>>>>>>>>>>>>>>> in amdgpu >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Hi Andres, >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Please see inline (as [Serguei]) >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Sincerely yours, >>>>>>>>>>>>>>>> Serguei Sagalovitch >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> From: Andres Rodriguez <andresr at valvesoftware.com> >>>>>>>>>>>>>>>> Sent: December 16, 2016 8:29 PM >>>>>>>>>>>>>>>> To: Sagalovitch, Serguei; amd-gfx at lists.freedesktop.org >>>>>>>>>>>>>>>> Subject: RE: [RFC] Mechanism for high priority scheduling >>>>>>>>>>>>>>>> in amdgpu >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Hi Serguei, >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Thanks for the feedback. Answers inline as [AR]. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Regards, >>>>>>>>>>>>>>>> Andres >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> ________________________________________ >>>>>>>>>>>>>>>> From: Sagalovitch, Serguei [Serguei.Sagalovitch at amd.com] >>>>>>>>>>>>>>>> Sent: Friday, December 16, 2016 8:15 PM >>>>>>>>>>>>>>>> To: Andres Rodriguez; amd-gfx at lists.freedesktop.org >>>>>>>>>>>>>>>> Subject: Re: [RFC] Mechanism for high priority scheduling >>>>>>>>>>>>>>>> in amdgpu >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Andres, >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Quick comments: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> 1) To minimize "bubbles", etc. we need to "force" CU >>>>>>>>>>>>>>>> assignments/binding >>>>>>>>>>>>>>>> to high-priority queue when it will be in use and "free" >>>>>>>>>>>>>>>> them later >>>>>>>>>>>>>>>> (we do not want forever take CUs from e.g. graphic task to >>>>>>>>>>>>>>>> degrade >>>>>>>>>>>>>>>> graphics >>>>>>>>>>>>>>>> performance). >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Otherwise we could have scenario when long graphics task (or >>>>>>>>>>>>>>>> low-priority >>>>>>>>>>>>>>>> compute) will took all (extra) CUs and high--priority will >>>>>>>>>>>>>>>> wait for >>>>>>>>>>>>>>>> needed resources. >>>>>>>>>>>>>>>> It will not be visible on "NOP " but only when you submit >>>>>>>>>>>>>>>> "real" >>>>>>>>>>>>>>>> compute task >>>>>>>>>>>>>>>> so I would recommend not to use "NOP" packets at all for >>>>>>>>>>>>>>>> testing. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> It (CU assignment) could be relatively easy done when >>>>>>>>>>>>>>>> everything is >>>>>>>>>>>>>>>> going via kernel >>>>>>>>>>>>>>>> (e.g. as part of frame submission) but I must admit that I >>>>>>>>>>>>>>>> am not sure >>>>>>>>>>>>>>>> about the best way for user level submissions (amdkfd). >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> [AR] I wasn't aware of this part of the programming >>>>>>>>>>>>>>>> sequence. Thanks >>>>>>>>>>>>>>>> for the heads up! >>>>>>>>>>>>>>>> Is this similar to the CU masking programming? >>>>>>>>>>>>>>>> [Serguei] Yes. To simplify: the problem is that "scheduler" >>>>>>>>>>>>>>>> when >>>>>>>>>>>>>>>> deciding which >>>>>>>>>>>>>>>> queue to run will check if there is enough resources and >>>>>>>>>>>>>>>> if not then >>>>>>>>>>>>>>>> it will begin >>>>>>>>>>>>>>>> to check other queues with lower priority. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> 2) I would recommend to dedicate the whole pipe to >>>>>>>>>>>>>>>> high-priority >>>>>>>>>>>>>>>> queue and have >>>>>>>>>>>>>>>> nothing their except it. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> [AR] I'm guessing in this context you mean pipe = queue? >>>>>>>>>>>>>>>> (as opposed >>>>>>>>>>>>>>>> to the MEC definition >>>>>>>>>>>>>>>> of pipe, which is a grouping of queues). I say this because >>>>>>>>>>>>>>>> amdgpu >>>>>>>>>>>>>>>> only has access to 1 pipe, >>>>>>>>>>>>>>>> and the rest are statically partitioned for amdkfd usage. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> [Serguei] No. I mean pipe :-) as MEC define it. As far as I >>>>>>>>>>>>>>>> understand (by simplifying) >>>>>>>>>>>>>>>> some scheduling is per pipe. I know about the current >>>>>>>>>>>>>>>> allocation >>>>>>>>>>>>>>>> scheme but I do not think >>>>>>>>>>>>>>>> that it is ideal. I would assume that we need to switch to >>>>>>>>>>>>>>>> dynamical partition >>>>>>>>>>>>>>>> of resources based on the workload otherwise we will have >>>>>>>>>>>>>>>> resource >>>>>>>>>>>>>>>> conflict >>>>>>>>>>>>>>>> between Vulkan compute and OpenCL. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> BTW: Which user level API do you want to use for compute: >>>>>>>>>>>>>>>> Vulkan or >>>>>>>>>>>>>>>> OpenCL? >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> [AR] Vulkan >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> [Serguei] Vulkan works via amdgpu (kernel submissions) so >>>>>>>>>>>>>>>> amdkfd will >>>>>>>>>>>>>>>> be not >>>>>>>>>>>>>>>> involved. I would assume that in the case of VR we will >>>>>>>>>>>>>>>> have one main >>>>>>>>>>>>>>>> application ("console" mode(?)) so we could temporally >>>>>>>>>>>>>>>> "ignore" >>>>>>>>>>>>>>>> OpenCL/ROCm needs when VR is running. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> we will not be able to provide a solution compatible with >>>>>>>>>>>>>>>>> GFX >>>>>>>>>>>>>>>>> worloads. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> I assume that you are talking about graphics? Am I right? >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> [AR] Yeah, my understanding is that pre-empting the >>>>>>>>>>>>>>>> currently running >>>>>>>>>>>>>>>> graphics job and scheduling in >>>>>>>>>>>>>>>> something else using mid-buffer pre-emption has some cases >>>>>>>>>>>>>>>> where it >>>>>>>>>>>>>>>> doesn't work well. But if with >>>>>>>>>>>>>>>> polaris10 it starts working well, it might be a better >>>>>>>>>>>>>>>> solution for >>>>>>>>>>>>>>>> us (because the whole reprojection >>>>>>>>>>>>>>>> work uses the vulkan graphics stack at the moment, and >>>>>>>>>>>>>>>> porting it to >>>>>>>>>>>>>>>> compute is not trivial). >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> [Serguei] The problem with pre-emption of graphics task: >>>>>>>>>>>>>>>> (a) it may >>>>>>>>>>>>>>>> take time so >>>>>>>>>>>>>>>> latency may suffer (b) to preempt we need to have different >>>>>>>>>>>>>>>> "context" >>>>>>>>>>>>>>>> - we want >>>>>>>>>>>>>>>> to guarantee that submissions from the same context will be >>>>>>>>>>>>>>>> executed >>>>>>>>>>>>>>>> in order. >>>>>>>>>>>>>>>> BTW: (a) Do you want "preempt" and later resume or do you >>>>>>>>>>>>>>>> want >>>>>>>>>>>>>>>> "preempt" and >>>>>>>>>>>>>>>> "cancel/abort"? (b) Vulkan is generic API and could be used >>>>>>>>>>>>>>>> for graphics as well as for plain compute tasks >>>>>>>>>>>>>>>> (VK_QUEUE_COMPUTE_BIT). >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Sincerely yours, >>>>>>>>>>>>>>>> Serguei Sagalovitch >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> From: amd-gfx <amd-gfx-bounces at lists.freedesktop.org> on >>>>>>>>>>>>>>>> behalf of >>>>>>>>>>>>>>>> Andres Rodriguez <andresr at valvesoftware.com> >>>>>>>>>>>>>>>> Sent: December 16, 2016 6:15 PM >>>>>>>>>>>>>>>> To: amd-gfx at lists.freedesktop.org >>>>>>>>>>>>>>>> Subject: [RFC] Mechanism for high priority scheduling in >>>>>>>>>>>>>>>> amdgpu >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Hi Everyone, >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> This RFC is also available as a gist here: >>>>>>>>>>>>>>>> https://gist.github.com/lostgo >>>>>>>>>>>>>>>> at/7000432cd6864265dbc2c3ab93204249 >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> [RFC] Mechanism for high priority scheduling in amdgpu >>>>>>>>>>>>>>>> gist.github.com >>>>>>>>>>>>>>>> [RFC] Mechanism for high priority scheduling in amdgpu >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> [RFC] Mechanism for high priority scheduling in amdgpu >>>>>>>>>>>>>>>> gist.github.com >>>>>>>>>>>>>>>> [RFC] Mechanism for high priority scheduling in amdgpu >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> [RFC] Mechanism for high priority scheduling in amdgpu >>>>>>>>>>>>>>>> gist.github.com >>>>>>>>>>>>>>>> [RFC] Mechanism for high priority scheduling in amdgpu >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> We are interested in feedback for a mechanism to >>>>>>>>>>>>>>>> effectively schedule >>>>>>>>>>>>>>>> high >>>>>>>>>>>>>>>> priority VR reprojection tasks (also referred to as >>>>>>>>>>>>>>>> time-warping) for >>>>>>>>>>>>>>>> Polaris10 >>>>>>>>>>>>>>>> running on the amdgpu kernel driver. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Brief context: >>>>>>>>>>>>>>>> -------------- >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> The main objective of reprojection is to avoid motion >>>>>>>>>>>>>>>> sickness for VR >>>>>>>>>>>>>>>> users in >>>>>>>>>>>>>>>> scenarios where the game or application would fail to finish >>>>>>>>>>>>>>>> rendering a new >>>>>>>>>>>>>>>> frame in time for the next VBLANK. When this happens, the >>>>>>>>>>>>>>>> user's head >>>>>>>>>>>>>>>> movements >>>>>>>>>>>>>>>> are not reflected on the Head Mounted Display (HMD) for the >>>>>>>>>>>>>>>> duration >>>>>>>>>>>>>>>> of an >>>>>>>>>>>>>>>> extra frame. This extended mismatch between the inner ear >>>>>>>>>>>>>>>> and the >>>>>>>>>>>>>>>> eyes may >>>>>>>>>>>>>>>> cause the user to experience motion sickness. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> The VR compositor deals with this problem by fabricating a >>>>>>>>>>>>>>>> new frame >>>>>>>>>>>>>>>> using the >>>>>>>>>>>>>>>> user's updated head position in combination with the >>>>>>>>>>>>>>>> previous frames. >>>>>>>>>>>>>>>> This >>>>>>>>>>>>>>>> avoids a prolonged mismatch between the HMD output and the >>>>>>>>>>>>>>>> inner ear. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Because of the adverse effects on the user, we require high >>>>>>>>>>>>>>>> confidence that the >>>>>>>>>>>>>>>> reprojection task will complete before the VBLANK interval. >>>>>>>>>>>>>>>> Even if >>>>>>>>>>>>>>>> the GFX pipe >>>>>>>>>>>>>>>> is currently full of work from the game/application (which >>>>>>>>>>>>>>>> is most >>>>>>>>>>>>>>>> likely the case). >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> For more details and illustrations, please refer to the >>>>>>>>>>>>>>>> following >>>>>>>>>>>>>>>> document: >>>>>>>>>>>>>>>> https://community.amd.com/comm >>>>>>>>>>>>>>>> unity/gaming/blog/2016/03/28/asynchronous-shaders-evolved >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Gaming: Asynchronous Shaders Evolved | Community >>>>>>>>>>>>>>>> community.amd.com >>>>>>>>>>>>>>>> One of the most exciting new developments in GPU technology >>>>>>>>>>>>>>>> over the >>>>>>>>>>>>>>>> past year has been the adoption of asynchronous shaders, >>>>>>>>>>>>>>>> which can >>>>>>>>>>>>>>>> make more efficient use of ... >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Gaming: Asynchronous Shaders Evolved | Community >>>>>>>>>>>>>>>> community.amd.com >>>>>>>>>>>>>>>> One of the most exciting new developments in GPU technology >>>>>>>>>>>>>>>> over the >>>>>>>>>>>>>>>> past year has been the adoption of asynchronous shaders, >>>>>>>>>>>>>>>> which can >>>>>>>>>>>>>>>> make more efficient use of ... >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Gaming: Asynchronous Shaders Evolved | Community >>>>>>>>>>>>>>>> community.amd.com >>>>>>>>>>>>>>>> One of the most exciting new developments in GPU technology >>>>>>>>>>>>>>>> over the >>>>>>>>>>>>>>>> past year has been the adoption of asynchronous shaders, >>>>>>>>>>>>>>>> which can >>>>>>>>>>>>>>>> make more efficient use of ... >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Requirements: >>>>>>>>>>>>>>>> ------------- >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> The mechanism must expose the following functionaility: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> * Job round trip time must be predictable, from >>>>>>>>>>>>>>>> submission to >>>>>>>>>>>>>>>> fence signal >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> * The mechanism must support compute workloads. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Goals: >>>>>>>>>>>>>>>> ------ >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> * The mechanism should provide low submission latencies >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Test: submitting a NOP packet through the mechanism on busy >>>>>>>>>>>>>>>> hardware >>>>>>>>>>>>>>>> should >>>>>>>>>>>>>>>> be equivalent to submitting a NOP on idle hardware. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Nice to have: >>>>>>>>>>>>>>>> ------------- >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> * The mechanism should also support GFX workloads. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> My understanding is that with the current hardware >>>>>>>>>>>>>>>> capabilities in >>>>>>>>>>>>>>>> Polaris10 we >>>>>>>>>>>>>>>> will not be able to provide a solution compatible with GFX >>>>>>>>>>>>>>>> worloads. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> But I would love to hear otherwise. So if anyone has an >>>>>>>>>>>>>>>> idea, >>>>>>>>>>>>>>>> approach or >>>>>>>>>>>>>>>> suggestion that will also be compatible with the GFX ring, >>>>>>>>>>>>>>>> please let >>>>>>>>>>>>>>>> us know >>>>>>>>>>>>>>>> about it. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> * The above guarantees should also be respected by >>>>>>>>>>>>>>>> amdkfd workloads >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Would be good to have for consistency, but not strictly >>>>>>>>>>>>>>>> necessary as >>>>>>>>>>>>>>>> users running >>>>>>>>>>>>>>>> games are not traditionally running HPC workloads in the >>>>>>>>>>>>>>>> background. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Proposed approach: >>>>>>>>>>>>>>>> ------------------ >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Similar to the windows driver, we could expose a high >>>>>>>>>>>>>>>> priority >>>>>>>>>>>>>>>> compute queue to >>>>>>>>>>>>>>>> userspace. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Submissions to this compute queue will be scheduled with >>>>>>>>>>>>>>>> high >>>>>>>>>>>>>>>> priority, and may >>>>>>>>>>>>>>>> acquire hardware resources previously in use by other >>>>>>>>>>>>>>>> queues. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> This can be achieved by taking advantage of the 'priority' >>>>>>>>>>>>>>>> field in >>>>>>>>>>>>>>>> the HQDs >>>>>>>>>>>>>>>> and could be programmed by amdgpu or the amdgpu scheduler. >>>>>>>>>>>>>>>> The relevant >>>>>>>>>>>>>>>> register fields are: >>>>>>>>>>>>>>>> * mmCP_HQD_PIPE_PRIORITY >>>>>>>>>>>>>>>> * mmCP_HQD_QUEUE_PRIORITY >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Implementation approach 1 - static partitioning: >>>>>>>>>>>>>>>> ------------------------------------------------ >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> The amdgpu driver currently controls 8 compute queues from >>>>>>>>>>>>>>>> pipe0. We can >>>>>>>>>>>>>>>> statically partition these as follows: >>>>>>>>>>>>>>>> * 7x regular >>>>>>>>>>>>>>>> * 1x high priority >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> The relevant priorities can be set so that submissions to >>>>>>>>>>>>>>>> the high >>>>>>>>>>>>>>>> priority >>>>>>>>>>>>>>>> ring will starve the other compute rings and the GFX ring. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> The amdgpu scheduler will only place jobs into the high >>>>>>>>>>>>>>>> priority >>>>>>>>>>>>>>>> rings if the >>>>>>>>>>>>>>>> context is marked as high priority. And a corresponding >>>>>>>>>>>>>>>> priority >>>>>>>>>>>>>>>> should be >>>>>>>>>>>>>>>> added to keep track of this information: >>>>>>>>>>>>>>>> * AMD_SCHED_PRIORITY_KERNEL >>>>>>>>>>>>>>>> * -> AMD_SCHED_PRIORITY_HIGH >>>>>>>>>>>>>>>> * AMD_SCHED_PRIORITY_NORMAL >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> The user will request a high priority context by setting an >>>>>>>>>>>>>>>> appropriate flag >>>>>>>>>>>>>>>> in drm_amdgpu_ctx_in (AMDGPU_CTX_HIGH_PRIORITY or similar): >>>>>>>>>>>>>>>> https://github.com/torvalds/li >>>>>>>>>>>>>>>> nux/blob/master/include/uapi/drm/amdgpu_drm.h#L163 >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> The setting is in a per context level so that we can: >>>>>>>>>>>>>>>> * Maintain a consistent FIFO ordering of all >>>>>>>>>>>>>>>> submissions to a >>>>>>>>>>>>>>>> context >>>>>>>>>>>>>>>> * Create high priority and non-high priority contexts >>>>>>>>>>>>>>>> in the same >>>>>>>>>>>>>>>> process >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Implementation approach 2 - dynamic priority programming: >>>>>>>>>>>>>>>> --------------------------------------------------------- >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Similar to the above, but instead of programming the >>>>>>>>>>>>>>>> priorities and >>>>>>>>>>>>>>>> amdgpu_init() time, the SW scheduler will reprogram the >>>>>>>>>>>>>>>> queue priorities >>>>>>>>>>>>>>>> dynamically when scheduling a task. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> This would involve having a hardware specific callback from >>>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>> scheduler to >>>>>>>>>>>>>>>> set the appropriate queue priority: set_priority(int ring, >>>>>>>>>>>>>>>> int index, >>>>>>>>>>>>>>>> int priority) >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> During this callback we would have to grab the SRBM mutex >>>>>>>>>>>>>>>> to perform >>>>>>>>>>>>>>>> the appropriate >>>>>>>>>>>>>>>> HW programming, and I'm not really sure if that is >>>>>>>>>>>>>>>> something we >>>>>>>>>>>>>>>> should be doing from >>>>>>>>>>>>>>>> the scheduler. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On the positive side, this approach would allow us to >>>>>>>>>>>>>>>> program a range of >>>>>>>>>>>>>>>> priorities for jobs instead of a single "high priority" >>>>>>>>>>>>>>>> value", >>>>>>>>>>>>>>>> achieving >>>>>>>>>>>>>>>> something similar to the niceness API available for CPU >>>>>>>>>>>>>>>> scheduling. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> I'm not sure if this flexibility is something that we would >>>>>>>>>>>>>>>> need for >>>>>>>>>>>>>>>> our use >>>>>>>>>>>>>>>> case, but it might be useful in other scenarios (multiple >>>>>>>>>>>>>>>> users >>>>>>>>>>>>>>>> sharing compute >>>>>>>>>>>>>>>> time on a server). >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> This approach would require a new int field in >>>>>>>>>>>>>>>> drm_amdgpu_ctx_in, or >>>>>>>>>>>>>>>> repurposing >>>>>>>>>>>>>>>> of the flags field. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Known current obstacles: >>>>>>>>>>>>>>>> ------------------------ >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> The SQ is currently programmed to disregard the HQD >>>>>>>>>>>>>>>> priorities, and >>>>>>>>>>>>>>>> instead it picks >>>>>>>>>>>>>>>> jobs at random. Settings from the shader itself are also >>>>>>>>>>>>>>>> disregarded >>>>>>>>>>>>>>>> as this is >>>>>>>>>>>>>>>> considered a privileged field. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Effectively we can get our compute wavefront launched ASAP, >>>>>>>>>>>>>>>> but we >>>>>>>>>>>>>>>> might not get the >>>>>>>>>>>>>>>> time we need on the SQ. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> The current programming would have to be changed to allow >>>>>>>>>>>>>>>> priority >>>>>>>>>>>>>>>> propagation >>>>>>>>>>>>>>>> from the HQD into the SQ. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Generic approach for all HW IPs: >>>>>>>>>>>>>>>> -------------------------------- >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> For consistency purposes, the high priority context can be >>>>>>>>>>>>>>>> enabled >>>>>>>>>>>>>>>> for all HW IPs >>>>>>>>>>>>>>>> with support of the SW scheduler. This will function >>>>>>>>>>>>>>>> similarly to the >>>>>>>>>>>>>>>> current >>>>>>>>>>>>>>>> AMD_SCHED_PRIORITY_KERNEL priority, where the job can jump >>>>>>>>>>>>>>>> ahead of >>>>>>>>>>>>>>>> anything not >>>>>>>>>>>>>>>> commited to the HW queue. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> The benefits of requesting a high priority context for a >>>>>>>>>>>>>>>> non-compute >>>>>>>>>>>>>>>> queue will >>>>>>>>>>>>>>>> be lesser (e.g. up to 10s of wait time if a GFX command is >>>>>>>>>>>>>>>> stuck in >>>>>>>>>>>>>>>> front of >>>>>>>>>>>>>>>> you), but having the API in place will allow us to easily >>>>>>>>>>>>>>>> improve the >>>>>>>>>>>>>>>> implementation >>>>>>>>>>>>>>>> in the future as new features become available in new >>>>>>>>>>>>>>>> hardware. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Future steps: >>>>>>>>>>>>>>>> ------------- >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Once we have an approach settled, I can take care of the >>>>>>>>>>>>>>>> implementation. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Also, once the interface is mostly decided, we can start >>>>>>>>>>>>>>>> thinking about >>>>>>>>>>>>>>>> exposing the high priority queue through radv. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Request for feedback: >>>>>>>>>>>>>>>> --------------------- >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> We aren't married to any of the approaches outlined above. >>>>>>>>>>>>>>>> Our goal >>>>>>>>>>>>>>>> is to >>>>>>>>>>>>>>>> obtain a mechanism that will allow us to complete the >>>>>>>>>>>>>>>> reprojection >>>>>>>>>>>>>>>> job within a >>>>>>>>>>>>>>>> predictable amount of time. So if anyone anyone has any >>>>>>>>>>>>>>>> suggestions for >>>>>>>>>>>>>>>> improvements or alternative strategies we are more than >>>>>>>>>>>>>>>> happy to hear >>>>>>>>>>>>>>>> them. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> If any of the technical information above is also >>>>>>>>>>>>>>>> incorrect, feel >>>>>>>>>>>>>>>> free to point >>>>>>>>>>>>>>>> out my misunderstandings. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Looking forward to hearing from you. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Regards, >>>>>>>>>>>>>>>> Andres >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>>>> amd-gfx mailing list >>>>>>>>>>>>>>>> amd-gfx at lists.freedesktop.org >>>>>>>>>>>>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> amd-gfx Info Page - lists.freedesktop.org >>>>>>>>>>>>>>>> lists.freedesktop.org >>>>>>>>>>>>>>>> To see the collection of prior postings to the list, visit >>>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>> amd-gfx Archives. Using amd-gfx: To post a message to all >>>>>>>>>>>>>>>> the list >>>>>>>>>>>>>>>> members, send email ... >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> amd-gfx Info Page - lists.freedesktop.org >>>>>>>>>>>>>>>> lists.freedesktop.org >>>>>>>>>>>>>>>> To see the collection of prior postings to the list, visit >>>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>> amd-gfx Archives. Using amd-gfx: To post a message to all >>>>>>>>>>>>>>>> the list >>>>>>>>>>>>>>>> members, send email ... >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>>>> amd-gfx mailing list >>>>>>>>>>>>>>>> amd-gfx at lists.freedesktop.org >>>>>>>>>>>>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>>> amd-gfx mailing list >>>>>>>>>>>>>>> amd-gfx at lists.freedesktop.org >>>>>>>>>>>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx >>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>> amd-gfx mailing list >>>>>>>>>>>> amd-gfx at lists.freedesktop.org >>>>>>>>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>> Sincerely yours, >>>>>>>>> Serguei Sagalovitch >>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>>> amd-gfx mailing list >>>>>>>>> amd-gfx at lists.freedesktop.org >>>>>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> Sincerely yours, >>>> Serguei Sagalovitch >>>> >>>> >>> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20161223/74800ac0/attachment-0001.html>