By the way, are you using all-open driver or amdgpu-pro driver? +David Mao, who is working on our Vulkan driver. Regards, David Zhou On 2016å¹´12æ??18æ?¥ 06:05, Pierre-Loup A. Griffais wrote: > Hi Serguei, > > I'm also working on the bringing up our VR runtime on top of amgpu; > see replies inline. > > On 12/16/2016 09:05 PM, Sagalovitch, Serguei wrote: >> Andres, >> >>> For current VR workloads we have 3 separate processes running >>> actually: >> So we could have potential memory overcommit case or do you do >> partitioning >> on your own? I would think that there is need to avoid overcomit in >> VR case to >> prevent any BO migration. > > You're entirely correct; currently the VR runtime is setting up > prioritized CPU scheduling for its VR compositor, we're working on > prioritized GPU scheduling and pre-emption (eg. this thread), and in > the future it will make sense to do work in order to make sure that > its memory allocations do not get evicted, to prevent any unwelcome > additional latency in the event of needing to perform just-in-time > reprojection. > >> BTW: Do you mean __real__ processes or threads? >> Based on my understanding sharing BOs between different processes >> could introduce additional synchronization constrains. btw: I am not >> sure >> if we are able to share Vulkan sync. object cross-process boundary. > > They are different processes; it is important for the compositor that > is responsible for quality-of-service features such as consistently > presenting distorted frames with the right latency, reprojection, etc, > to be separate from the main application. > > Currently we are using unreleased cross-process memory and semaphore > extensions to fetch updated eye images from the client application, > but the just-in-time reprojection discussed here does not actually > have any direct interactions with cross-process resource sharing, > since it's achieved by using whatever is the latest, most up-to-date > eye images that have already been sent by the client application, > which are already available to use without additional synchronization. > >> >>> 3) System compositor (we are looking at approaches to remove this >>> overhead) >> Yes, IMHO the best is to run in "full screen mode". > > Yes, we are working on mechanisms to present directly to the headset > display without any intermediaries as a separate effort. > >> >>> The latency is our main concern, >> I would assume that this is the known problem (at least for compute >> usage). >> It looks like that amdgpu / kernel submission is rather CPU intensive >> (at least >> in the default configuration). > > As long as it's a consistent cost, it shouldn't an issue. However, if > there's high degrees of variance then that would be troublesome and we > would need to account for the worst case. > > Hopefully the requirements and approach we described make sense, we're > looking forward to your feedback and suggestions. > > Thanks! > - Pierre-Loup > >> >> Sincerely yours, >> Serguei Sagalovitch >> >> >> From: Andres Rodriguez <andresr at valvesoftware.com> >> Sent: December 16, 2016 10:00 PM >> To: Sagalovitch, Serguei; amd-gfx at lists.freedesktop.org >> Subject: RE: [RFC] Mechanism for high priority scheduling in amdgpu >> >> Hey Serguei, >> >>> [Serguei] No. I mean pipe :-) as MEC define it. As far as I >>> understand (by simplifying) >>> some scheduling is per pipe. I know about the current allocation >>> scheme but I do not think >>> that it is ideal. I would assume that we need to switch to >>> dynamical partition >>> of resources based on the workload otherwise we will have resource >>> conflict >>> between Vulkan compute and OpenCL. >> >> I agree the partitioning isn't ideal. I'm hoping we can start with a >> solution that assumes that >> only pipe0 has any work and the other pipes are idle (no HSA/ROCm >> running on the system). >> >> This should be more or less the use case we expect from VR users. >> >> I agree the split is currently not ideal, but I'd like to consider >> that a separate task, because >> making it dynamic is not straight forward :P >> >>> [Serguei] Vulkan works via amdgpu (kernel submissions) so amdkfd >>> will be not >>> involved. I would assume that in the case of VR we will have one main >>> application ("console" mode(?)) so we could temporally "ignore" >>> OpenCL/ROCm needs when VR is running. >> >> Correct, this is why we want to enable the high priority compute >> queue through >> libdrm-amdgpu, so that we can expose it through Vulkan later. >> >> For current VR workloads we have 3 separate processes running actually: >> 1) Game process >> 2) VR Compositor (this is the process that will require high >> priority queue) >> 3) System compositor (we are looking at approaches to remove this >> overhead) >> >> For now I think it is okay to assume no OpenCL/ROCm running >> simultaneously, but >> I would also like to be able to address this case in the future >> (cross-pipe priorities). >> >>> [Serguei] The problem with pre-emption of graphics task: (a) it >>> may take time so >>> latency may suffer >> >> The latency is our main concern, we want something that is >> predictable. A good >> illustration of what the reprojection scheduling looks like can be >> found here: >> https://community.amd.com/servlet/JiveServlet/showImage/38-1310-104754/pastedImage_3.png >> >> >>> (b) to preempt we need to have different "context" - we want >>> to guarantee that submissions from the same context will be executed >>> in order. >> >> This is okay, as the reprojection work doesn't have dependencies on >> the game context, and it >> even happens in a separate process. >> >>> BTW: (a) Do you want "preempt" and later resume or do you want >>> "preempt" and >>> "cancel/abort" >> >> Preempt the game with the compositor task and then resume it. >> >>> (b) Vulkan is generic API and could be used for graphics as well as >>> for plain compute tasks (VK_QUEUE_COMPUTE_BIT). >> >> Yeah, the plan is to use vulkan compute. But if you figure out a way >> for us to get >> a guaranteed execution time using vulkan graphics, then I'll take you >> out for a beer :) >> >> Regards, >> Andres >> ________________________________________ >> From: Sagalovitch, Serguei [Serguei.Sagalovitch at amd.com] >> Sent: Friday, December 16, 2016 9:13 PM >> To: Andres Rodriguez; amd-gfx at lists.freedesktop.org >> Subject: Re: [RFC] Mechanism for high priority scheduling in amdgpu >> >> Hi Andres, >> >> Please see inline (as [Serguei]) >> >> Sincerely yours, >> Serguei Sagalovitch >> >> >> From: Andres Rodriguez <andresr at valvesoftware.com> >> Sent: December 16, 2016 8:29 PM >> To: Sagalovitch, Serguei; amd-gfx at lists.freedesktop.org >> Subject: RE: [RFC] Mechanism for high priority scheduling in amdgpu >> >> Hi Serguei, >> >> Thanks for the feedback. Answers inline as [AR]. >> >> Regards, >> Andres >> >> ________________________________________ >> From: Sagalovitch, Serguei [Serguei.Sagalovitch at amd.com] >> Sent: Friday, December 16, 2016 8:15 PM >> To: Andres Rodriguez; amd-gfx at lists.freedesktop.org >> Subject: Re: [RFC] Mechanism for high priority scheduling in amdgpu >> >> Andres, >> >> >> Quick comments: >> >> 1) To minimize "bubbles", etc. we need to "force" CU assignments/binding >> to high-priority queue when it will be in use and "free" them later >> (we do not want forever take CUs from e.g. graphic task to degrade >> graphics >> performance). >> >> Otherwise we could have scenario when long graphics task (or >> low-priority >> compute) will took all (extra) CUs and high--priority will wait for >> needed resources. >> It will not be visible on "NOP " but only when you submit "real" >> compute task >> so I would recommend not to use "NOP" packets at all for testing. >> >> It (CU assignment) could be relatively easy done when everything is >> going via kernel >> (e.g. as part of frame submission) but I must admit that I am not sure >> about the best way for user level submissions (amdkfd). >> >> [AR] I wasn't aware of this part of the programming sequence. Thanks >> for the heads up! >> Is this similar to the CU masking programming? >> [Serguei] Yes. To simplify: the problem is that "scheduler" when >> deciding which >> queue to run will check if there is enough resources and if not then >> it will begin >> to check other queues with lower priority. >> >> 2) I would recommend to dedicate the whole pipe to high-priority >> queue and have >> nothing their except it. >> >> [AR] I'm guessing in this context you mean pipe = queue? (as opposed >> to the MEC definition >> of pipe, which is a grouping of queues). I say this because amdgpu >> only has access to 1 pipe, >> and the rest are statically partitioned for amdkfd usage. >> >> [Serguei] No. I mean pipe :-) as MEC define it. As far as I >> understand (by simplifying) >> some scheduling is per pipe. I know about the current allocation >> scheme but I do not think >> that it is ideal. I would assume that we need to switch to >> dynamical partition >> of resources based on the workload otherwise we will have resource >> conflict >> between Vulkan compute and OpenCL. >> >> >> BTW: Which user level API do you want to use for compute: Vulkan or >> OpenCL? >> >> [AR] Vulkan >> >> [Serguei] Vulkan works via amdgpu (kernel submissions) so amdkfd will >> be not >> involved. I would assume that in the case of VR we will have one main >> application ("console" mode(?)) so we could temporally "ignore" >> OpenCL/ROCm needs when VR is running. >> >>> we will not be able to provide a solution compatible with GFX >>> worloads. >> I assume that you are talking about graphics? Am I right? >> >> [AR] Yeah, my understanding is that pre-empting the currently running >> graphics job and scheduling in >> something else using mid-buffer pre-emption has some cases where it >> doesn't work well. But if with >> polaris10 it starts working well, it might be a better solution for >> us (because the whole reprojection >> work uses the vulkan graphics stack at the moment, and porting it to >> compute is not trivial). >> >> [Serguei] The problem with pre-emption of graphics task: (a) it may >> take time so >> latency may suffer (b) to preempt we need to have different "context" >> - we want >> to guarantee that submissions from the same context will be executed >> in order. >> BTW: (a) Do you want "preempt" and later resume or do you want >> "preempt" and >> "cancel/abort"? (b) Vulkan is generic API and could be used >> for graphics as well as for plain compute tasks (VK_QUEUE_COMPUTE_BIT). >> >> >> Sincerely yours, >> Serguei Sagalovitch >> >> >> >> From: amd-gfx <amd-gfx-bounces at lists.freedesktop.org> on behalf of >> Andres Rodriguez <andresr at valvesoftware.com> >> Sent: December 16, 2016 6:15 PM >> To: amd-gfx at lists.freedesktop.org >> Subject: [RFC] Mechanism for high priority scheduling in amdgpu >> >> Hi Everyone, >> >> This RFC is also available as a gist here: >> https://gist.github.com/lostgoat/7000432cd6864265dbc2c3ab93204249 >> >> >> >> [RFC] Mechanism for high priority scheduling in amdgpu >> gist.github.com >> [RFC] Mechanism for high priority scheduling in amdgpu >> >> >> >> [RFC] Mechanism for high priority scheduling in amdgpu >> gist.github.com >> [RFC] Mechanism for high priority scheduling in amdgpu >> >> >> >> >> [RFC] Mechanism for high priority scheduling in amdgpu >> gist.github.com >> [RFC] Mechanism for high priority scheduling in amdgpu >> >> >> We are interested in feedback for a mechanism to effectively schedule >> high >> priority VR reprojection tasks (also referred to as time-warping) for >> Polaris10 >> running on the amdgpu kernel driver. >> >> Brief context: >> -------------- >> >> The main objective of reprojection is to avoid motion sickness for VR >> users in >> scenarios where the game or application would fail to finish >> rendering a new >> frame in time for the next VBLANK. When this happens, the user's head >> movements >> are not reflected on the Head Mounted Display (HMD) for the duration >> of an >> extra frame. This extended mismatch between the inner ear and the >> eyes may >> cause the user to experience motion sickness. >> >> The VR compositor deals with this problem by fabricating a new frame >> using the >> user's updated head position in combination with the previous frames. >> This >> avoids a prolonged mismatch between the HMD output and the inner ear. >> >> Because of the adverse effects on the user, we require high >> confidence that the >> reprojection task will complete before the VBLANK interval. Even if >> the GFX pipe >> is currently full of work from the game/application (which is most >> likely the case). >> >> For more details and illustrations, please refer to the following >> document: >> https://community.amd.com/community/gaming/blog/2016/03/28/asynchronous-shaders-evolved >> >> >> >> Gaming: Asynchronous Shaders Evolved | Community >> community.amd.com >> One of the most exciting new developments in GPU technology over the >> past year has been the adoption of asynchronous shaders, which can >> make more efficient use of ... >> >> >> >> Gaming: Asynchronous Shaders Evolved | Community >> community.amd.com >> One of the most exciting new developments in GPU technology over the >> past year has been the adoption of asynchronous shaders, which can >> make more efficient use of ... >> >> >> >> Gaming: Asynchronous Shaders Evolved | Community >> community.amd.com >> One of the most exciting new developments in GPU technology over the >> past year has been the adoption of asynchronous shaders, which can >> make more efficient use of ... >> >> >> Requirements: >> ------------- >> >> The mechanism must expose the following functionaility: >> >> * Job round trip time must be predictable, from submission to >> fence signal >> >> * The mechanism must support compute workloads. >> >> Goals: >> ------ >> >> * The mechanism should provide low submission latencies >> >> Test: submitting a NOP packet through the mechanism on busy hardware >> should >> be equivalent to submitting a NOP on idle hardware. >> >> Nice to have: >> ------------- >> >> * The mechanism should also support GFX workloads. >> >> My understanding is that with the current hardware capabilities in >> Polaris10 we >> will not be able to provide a solution compatible with GFX worloads. >> >> But I would love to hear otherwise. So if anyone has an idea, >> approach or >> suggestion that will also be compatible with the GFX ring, please let >> us know >> about it. >> >> * The above guarantees should also be respected by amdkfd workloads >> >> Would be good to have for consistency, but not strictly necessary as >> users running >> games are not traditionally running HPC workloads in the background. >> >> Proposed approach: >> ------------------ >> >> Similar to the windows driver, we could expose a high priority >> compute queue to >> userspace. >> >> Submissions to this compute queue will be scheduled with high >> priority, and may >> acquire hardware resources previously in use by other queues. >> >> This can be achieved by taking advantage of the 'priority' field in >> the HQDs >> and could be programmed by amdgpu or the amdgpu scheduler. The relevant >> register fields are: >> * mmCP_HQD_PIPE_PRIORITY >> * mmCP_HQD_QUEUE_PRIORITY >> >> Implementation approach 1 - static partitioning: >> ------------------------------------------------ >> >> The amdgpu driver currently controls 8 compute queues from pipe0. We can >> statically partition these as follows: >> * 7x regular >> * 1x high priority >> >> The relevant priorities can be set so that submissions to the high >> priority >> ring will starve the other compute rings and the GFX ring. >> >> The amdgpu scheduler will only place jobs into the high priority >> rings if the >> context is marked as high priority. And a corresponding priority >> should be >> added to keep track of this information: >> * AMD_SCHED_PRIORITY_KERNEL >> * -> AMD_SCHED_PRIORITY_HIGH >> * AMD_SCHED_PRIORITY_NORMAL >> >> The user will request a high priority context by setting an >> appropriate flag >> in drm_amdgpu_ctx_in (AMDGPU_CTX_HIGH_PRIORITY or similar): >> https://github.com/torvalds/linux/blob/master/include/uapi/drm/amdgpu_drm.h#L163 >> >> >> The setting is in a per context level so that we can: >> * Maintain a consistent FIFO ordering of all submissions to a >> context >> * Create high priority and non-high priority contexts in the same >> process >> >> Implementation approach 2 - dynamic priority programming: >> --------------------------------------------------------- >> >> Similar to the above, but instead of programming the priorities and >> amdgpu_init() time, the SW scheduler will reprogram the queue priorities >> dynamically when scheduling a task. >> >> This would involve having a hardware specific callback from the >> scheduler to >> set the appropriate queue priority: set_priority(int ring, int index, >> int priority) >> >> During this callback we would have to grab the SRBM mutex to perform >> the appropriate >> HW programming, and I'm not really sure if that is something we >> should be doing from >> the scheduler. >> >> On the positive side, this approach would allow us to program a range of >> priorities for jobs instead of a single "high priority" value", >> achieving >> something similar to the niceness API available for CPU scheduling. >> >> I'm not sure if this flexibility is something that we would need for >> our use >> case, but it might be useful in other scenarios (multiple users >> sharing compute >> time on a server). >> >> This approach would require a new int field in drm_amdgpu_ctx_in, or >> repurposing >> of the flags field. >> >> Known current obstacles: >> ------------------------ >> >> The SQ is currently programmed to disregard the HQD priorities, and >> instead it picks >> jobs at random. Settings from the shader itself are also disregarded >> as this is >> considered a privileged field. >> >> Effectively we can get our compute wavefront launched ASAP, but we >> might not get the >> time we need on the SQ. >> >> The current programming would have to be changed to allow priority >> propagation >> from the HQD into the SQ. >> >> Generic approach for all HW IPs: >> -------------------------------- >> >> For consistency purposes, the high priority context can be enabled >> for all HW IPs >> with support of the SW scheduler. This will function similarly to the >> current >> AMD_SCHED_PRIORITY_KERNEL priority, where the job can jump ahead of >> anything not >> commited to the HW queue. >> >> The benefits of requesting a high priority context for a non-compute >> queue will >> be lesser (e.g. up to 10s of wait time if a GFX command is stuck in >> front of >> you), but having the API in place will allow us to easily improve the >> implementation >> in the future as new features become available in new hardware. >> >> Future steps: >> ------------- >> >> Once we have an approach settled, I can take care of the implementation. >> >> Also, once the interface is mostly decided, we can start thinking about >> exposing the high priority queue through radv. >> >> Request for feedback: >> --------------------- >> >> We aren't married to any of the approaches outlined above. Our goal >> is to >> obtain a mechanism that will allow us to complete the reprojection >> job within a >> predictable amount of time. So if anyone anyone has any suggestions for >> improvements or alternative strategies we are more than happy to hear >> them. >> >> If any of the technical information above is also incorrect, feel >> free to point >> out my misunderstandings. >> >> Looking forward to hearing from you. >> >> Regards, >> Andres >> >> _______________________________________________ >> amd-gfx mailing list >> amd-gfx at lists.freedesktop.org >> https://lists.freedesktop.org/mailman/listinfo/amd-gfx >> >> >> amd-gfx Info Page - lists.freedesktop.org >> lists.freedesktop.org >> To see the collection of prior postings to the list, visit the >> amd-gfx Archives. Using amd-gfx: To post a message to all the list >> members, send email ... >> >> >> >> amd-gfx Info Page - lists.freedesktop.org >> lists.freedesktop.org >> To see the collection of prior postings to the list, visit the >> amd-gfx Archives. Using amd-gfx: To post a message to all the list >> members, send email ... >> >> >> >> >> >> >> >> >> >> >> >> _______________________________________________ >> amd-gfx mailing list >> amd-gfx at lists.freedesktop.org >> https://lists.freedesktop.org/mailman/listinfo/amd-gfx >> > > _______________________________________________ > amd-gfx mailing list > amd-gfx at lists.freedesktop.org > https://lists.freedesktop.org/mailman/listinfo/amd-gfx