On 2016å¹´12æ??19æ?¥ 11:33, Pierre-Loup A. Griffais wrote: > We're currently working with the open stack; I assume that a mechanism > could be exposed by both open and Pro Vulkan userspace drivers and > that the amdgpu kernel interface improvements we would pursue > following this discussion would let both drivers take advantage of the > feature, correct? Of course. Does open stack have Vulkan support? Regards, David Zhou > > On 12/18/2016 07:26 PM, zhoucm1 wrote: >> By the way, are you using all-open driver or amdgpu-pro driver? >> >> +David Mao, who is working on our Vulkan driver. >> >> Regards, >> David Zhou >> >> On 2016å¹´12æ??18æ?¥ 06:05, Pierre-Loup A. Griffais wrote: >>> Hi Serguei, >>> >>> I'm also working on the bringing up our VR runtime on top of amgpu; >>> see replies inline. >>> >>> On 12/16/2016 09:05 PM, Sagalovitch, Serguei wrote: >>>> Andres, >>>> >>>>> For current VR workloads we have 3 separate processes running >>>>> actually: >>>> So we could have potential memory overcommit case or do you do >>>> partitioning >>>> on your own? I would think that there is need to avoid overcomit in >>>> VR case to >>>> prevent any BO migration. >>> >>> You're entirely correct; currently the VR runtime is setting up >>> prioritized CPU scheduling for its VR compositor, we're working on >>> prioritized GPU scheduling and pre-emption (eg. this thread), and in >>> the future it will make sense to do work in order to make sure that >>> its memory allocations do not get evicted, to prevent any unwelcome >>> additional latency in the event of needing to perform just-in-time >>> reprojection. >>> >>>> BTW: Do you mean __real__ processes or threads? >>>> Based on my understanding sharing BOs between different processes >>>> could introduce additional synchronization constrains. btw: I am not >>>> sure >>>> if we are able to share Vulkan sync. object cross-process boundary. >>> >>> They are different processes; it is important for the compositor that >>> is responsible for quality-of-service features such as consistently >>> presenting distorted frames with the right latency, reprojection, etc, >>> to be separate from the main application. >>> >>> Currently we are using unreleased cross-process memory and semaphore >>> extensions to fetch updated eye images from the client application, >>> but the just-in-time reprojection discussed here does not actually >>> have any direct interactions with cross-process resource sharing, >>> since it's achieved by using whatever is the latest, most up-to-date >>> eye images that have already been sent by the client application, >>> which are already available to use without additional synchronization. >>> >>>> >>>>> 3) System compositor (we are looking at approaches to remove this >>>>> overhead) >>>> Yes, IMHO the best is to run in "full screen mode". >>> >>> Yes, we are working on mechanisms to present directly to the headset >>> display without any intermediaries as a separate effort. >>> >>>> >>>>> The latency is our main concern, >>>> I would assume that this is the known problem (at least for compute >>>> usage). >>>> It looks like that amdgpu / kernel submission is rather CPU intensive >>>> (at least >>>> in the default configuration). >>> >>> As long as it's a consistent cost, it shouldn't an issue. However, if >>> there's high degrees of variance then that would be troublesome and we >>> would need to account for the worst case. >>> >>> Hopefully the requirements and approach we described make sense, we're >>> looking forward to your feedback and suggestions. >>> >>> Thanks! >>> - Pierre-Loup >>> >>>> >>>> Sincerely yours, >>>> Serguei Sagalovitch >>>> >>>> >>>> From: Andres Rodriguez <andresr at valvesoftware.com> >>>> Sent: December 16, 2016 10:00 PM >>>> To: Sagalovitch, Serguei; amd-gfx at lists.freedesktop.org >>>> Subject: RE: [RFC] Mechanism for high priority scheduling in amdgpu >>>> >>>> Hey Serguei, >>>> >>>>> [Serguei] No. I mean pipe :-) as MEC define it. As far as I >>>>> understand (by simplifying) >>>>> some scheduling is per pipe. I know about the current allocation >>>>> scheme but I do not think >>>>> that it is ideal. I would assume that we need to switch to >>>>> dynamical partition >>>>> of resources based on the workload otherwise we will have resource >>>>> conflict >>>>> between Vulkan compute and OpenCL. >>>> >>>> I agree the partitioning isn't ideal. I'm hoping we can start with a >>>> solution that assumes that >>>> only pipe0 has any work and the other pipes are idle (no HSA/ROCm >>>> running on the system). >>>> >>>> This should be more or less the use case we expect from VR users. >>>> >>>> I agree the split is currently not ideal, but I'd like to consider >>>> that a separate task, because >>>> making it dynamic is not straight forward :P >>>> >>>>> [Serguei] Vulkan works via amdgpu (kernel submissions) so amdkfd >>>>> will be not >>>>> involved. I would assume that in the case of VR we will have one >>>>> main >>>>> application ("console" mode(?)) so we could temporally "ignore" >>>>> OpenCL/ROCm needs when VR is running. >>>> >>>> Correct, this is why we want to enable the high priority compute >>>> queue through >>>> libdrm-amdgpu, so that we can expose it through Vulkan later. >>>> >>>> For current VR workloads we have 3 separate processes running >>>> actually: >>>> 1) Game process >>>> 2) VR Compositor (this is the process that will require high >>>> priority queue) >>>> 3) System compositor (we are looking at approaches to remove this >>>> overhead) >>>> >>>> For now I think it is okay to assume no OpenCL/ROCm running >>>> simultaneously, but >>>> I would also like to be able to address this case in the future >>>> (cross-pipe priorities). >>>> >>>>> [Serguei] The problem with pre-emption of graphics task: (a) it >>>>> may take time so >>>>> latency may suffer >>>> >>>> The latency is our main concern, we want something that is >>>> predictable. A good >>>> illustration of what the reprojection scheduling looks like can be >>>> found here: >>>> https://community.amd.com/servlet/JiveServlet/showImage/38-1310-104754/pastedImage_3.png >>>> >>>> >>>> >>>>> (b) to preempt we need to have different "context" - we want >>>>> to guarantee that submissions from the same context will be executed >>>>> in order. >>>> >>>> This is okay, as the reprojection work doesn't have dependencies on >>>> the game context, and it >>>> even happens in a separate process. >>>> >>>>> BTW: (a) Do you want "preempt" and later resume or do you want >>>>> "preempt" and >>>>> "cancel/abort" >>>> >>>> Preempt the game with the compositor task and then resume it. >>>> >>>>> (b) Vulkan is generic API and could be used for graphics as well as >>>>> for plain compute tasks (VK_QUEUE_COMPUTE_BIT). >>>> >>>> Yeah, the plan is to use vulkan compute. But if you figure out a way >>>> for us to get >>>> a guaranteed execution time using vulkan graphics, then I'll take you >>>> out for a beer :) >>>> >>>> Regards, >>>> Andres >>>> ________________________________________ >>>> From: Sagalovitch, Serguei [Serguei.Sagalovitch at amd.com] >>>> Sent: Friday, December 16, 2016 9:13 PM >>>> To: Andres Rodriguez; amd-gfx at lists.freedesktop.org >>>> Subject: Re: [RFC] Mechanism for high priority scheduling in amdgpu >>>> >>>> Hi Andres, >>>> >>>> Please see inline (as [Serguei]) >>>> >>>> Sincerely yours, >>>> Serguei Sagalovitch >>>> >>>> >>>> From: Andres Rodriguez <andresr at valvesoftware.com> >>>> Sent: December 16, 2016 8:29 PM >>>> To: Sagalovitch, Serguei; amd-gfx at lists.freedesktop.org >>>> Subject: RE: [RFC] Mechanism for high priority scheduling in amdgpu >>>> >>>> Hi Serguei, >>>> >>>> Thanks for the feedback. Answers inline as [AR]. >>>> >>>> Regards, >>>> Andres >>>> >>>> ________________________________________ >>>> From: Sagalovitch, Serguei [Serguei.Sagalovitch at amd.com] >>>> Sent: Friday, December 16, 2016 8:15 PM >>>> To: Andres Rodriguez; amd-gfx at lists.freedesktop.org >>>> Subject: Re: [RFC] Mechanism for high priority scheduling in amdgpu >>>> >>>> Andres, >>>> >>>> >>>> Quick comments: >>>> >>>> 1) To minimize "bubbles", etc. we need to "force" CU >>>> assignments/binding >>>> to high-priority queue when it will be in use and "free" them later >>>> (we do not want forever take CUs from e.g. graphic task to degrade >>>> graphics >>>> performance). >>>> >>>> Otherwise we could have scenario when long graphics task (or >>>> low-priority >>>> compute) will took all (extra) CUs and high--priority will wait for >>>> needed resources. >>>> It will not be visible on "NOP " but only when you submit "real" >>>> compute task >>>> so I would recommend not to use "NOP" packets at all for testing. >>>> >>>> It (CU assignment) could be relatively easy done when everything is >>>> going via kernel >>>> (e.g. as part of frame submission) but I must admit that I am not sure >>>> about the best way for user level submissions (amdkfd). >>>> >>>> [AR] I wasn't aware of this part of the programming sequence. Thanks >>>> for the heads up! >>>> Is this similar to the CU masking programming? >>>> [Serguei] Yes. To simplify: the problem is that "scheduler" when >>>> deciding which >>>> queue to run will check if there is enough resources and if not then >>>> it will begin >>>> to check other queues with lower priority. >>>> >>>> 2) I would recommend to dedicate the whole pipe to high-priority >>>> queue and have >>>> nothing their except it. >>>> >>>> [AR] I'm guessing in this context you mean pipe = queue? (as opposed >>>> to the MEC definition >>>> of pipe, which is a grouping of queues). I say this because amdgpu >>>> only has access to 1 pipe, >>>> and the rest are statically partitioned for amdkfd usage. >>>> >>>> [Serguei] No. I mean pipe :-) as MEC define it. As far as I >>>> understand (by simplifying) >>>> some scheduling is per pipe. I know about the current allocation >>>> scheme but I do not think >>>> that it is ideal. I would assume that we need to switch to >>>> dynamical partition >>>> of resources based on the workload otherwise we will have resource >>>> conflict >>>> between Vulkan compute and OpenCL. >>>> >>>> >>>> BTW: Which user level API do you want to use for compute: Vulkan or >>>> OpenCL? >>>> >>>> [AR] Vulkan >>>> >>>> [Serguei] Vulkan works via amdgpu (kernel submissions) so amdkfd will >>>> be not >>>> involved. I would assume that in the case of VR we will have one main >>>> application ("console" mode(?)) so we could temporally "ignore" >>>> OpenCL/ROCm needs when VR is running. >>>> >>>>> we will not be able to provide a solution compatible with GFX >>>>> worloads. >>>> I assume that you are talking about graphics? Am I right? >>>> >>>> [AR] Yeah, my understanding is that pre-empting the currently running >>>> graphics job and scheduling in >>>> something else using mid-buffer pre-emption has some cases where it >>>> doesn't work well. But if with >>>> polaris10 it starts working well, it might be a better solution for >>>> us (because the whole reprojection >>>> work uses the vulkan graphics stack at the moment, and porting it to >>>> compute is not trivial). >>>> >>>> [Serguei] The problem with pre-emption of graphics task: (a) it may >>>> take time so >>>> latency may suffer (b) to preempt we need to have different "context" >>>> - we want >>>> to guarantee that submissions from the same context will be executed >>>> in order. >>>> BTW: (a) Do you want "preempt" and later resume or do you want >>>> "preempt" and >>>> "cancel/abort"? (b) Vulkan is generic API and could be used >>>> for graphics as well as for plain compute tasks >>>> (VK_QUEUE_COMPUTE_BIT). >>>> >>>> >>>> Sincerely yours, >>>> Serguei Sagalovitch >>>> >>>> >>>> >>>> From: amd-gfx <amd-gfx-bounces at lists.freedesktop.org> on behalf of >>>> Andres Rodriguez <andresr at valvesoftware.com> >>>> Sent: December 16, 2016 6:15 PM >>>> To: amd-gfx at lists.freedesktop.org >>>> Subject: [RFC] Mechanism for high priority scheduling in amdgpu >>>> >>>> Hi Everyone, >>>> >>>> This RFC is also available as a gist here: >>>> https://gist.github.com/lostgoat/7000432cd6864265dbc2c3ab93204249 >>>> >>>> >>>> >>>> [RFC] Mechanism for high priority scheduling in amdgpu >>>> gist.github.com >>>> [RFC] Mechanism for high priority scheduling in amdgpu >>>> >>>> >>>> >>>> [RFC] Mechanism for high priority scheduling in amdgpu >>>> gist.github.com >>>> [RFC] Mechanism for high priority scheduling in amdgpu >>>> >>>> >>>> >>>> >>>> [RFC] Mechanism for high priority scheduling in amdgpu >>>> gist.github.com >>>> [RFC] Mechanism for high priority scheduling in amdgpu >>>> >>>> >>>> We are interested in feedback for a mechanism to effectively schedule >>>> high >>>> priority VR reprojection tasks (also referred to as time-warping) for >>>> Polaris10 >>>> running on the amdgpu kernel driver. >>>> >>>> Brief context: >>>> -------------- >>>> >>>> The main objective of reprojection is to avoid motion sickness for VR >>>> users in >>>> scenarios where the game or application would fail to finish >>>> rendering a new >>>> frame in time for the next VBLANK. When this happens, the user's head >>>> movements >>>> are not reflected on the Head Mounted Display (HMD) for the duration >>>> of an >>>> extra frame. This extended mismatch between the inner ear and the >>>> eyes may >>>> cause the user to experience motion sickness. >>>> >>>> The VR compositor deals with this problem by fabricating a new frame >>>> using the >>>> user's updated head position in combination with the previous frames. >>>> This >>>> avoids a prolonged mismatch between the HMD output and the inner ear. >>>> >>>> Because of the adverse effects on the user, we require high >>>> confidence that the >>>> reprojection task will complete before the VBLANK interval. Even if >>>> the GFX pipe >>>> is currently full of work from the game/application (which is most >>>> likely the case). >>>> >>>> For more details and illustrations, please refer to the following >>>> document: >>>> https://community.amd.com/community/gaming/blog/2016/03/28/asynchronous-shaders-evolved >>>> >>>> >>>> >>>> >>>> Gaming: Asynchronous Shaders Evolved | Community >>>> community.amd.com >>>> One of the most exciting new developments in GPU technology over the >>>> past year has been the adoption of asynchronous shaders, which can >>>> make more efficient use of ... >>>> >>>> >>>> >>>> Gaming: Asynchronous Shaders Evolved | Community >>>> community.amd.com >>>> One of the most exciting new developments in GPU technology over the >>>> past year has been the adoption of asynchronous shaders, which can >>>> make more efficient use of ... >>>> >>>> >>>> >>>> Gaming: Asynchronous Shaders Evolved | Community >>>> community.amd.com >>>> One of the most exciting new developments in GPU technology over the >>>> past year has been the adoption of asynchronous shaders, which can >>>> make more efficient use of ... >>>> >>>> >>>> Requirements: >>>> ------------- >>>> >>>> The mechanism must expose the following functionaility: >>>> >>>> * Job round trip time must be predictable, from submission to >>>> fence signal >>>> >>>> * The mechanism must support compute workloads. >>>> >>>> Goals: >>>> ------ >>>> >>>> * The mechanism should provide low submission latencies >>>> >>>> Test: submitting a NOP packet through the mechanism on busy hardware >>>> should >>>> be equivalent to submitting a NOP on idle hardware. >>>> >>>> Nice to have: >>>> ------------- >>>> >>>> * The mechanism should also support GFX workloads. >>>> >>>> My understanding is that with the current hardware capabilities in >>>> Polaris10 we >>>> will not be able to provide a solution compatible with GFX worloads. >>>> >>>> But I would love to hear otherwise. So if anyone has an idea, >>>> approach or >>>> suggestion that will also be compatible with the GFX ring, please let >>>> us know >>>> about it. >>>> >>>> * The above guarantees should also be respected by amdkfd >>>> workloads >>>> >>>> Would be good to have for consistency, but not strictly necessary as >>>> users running >>>> games are not traditionally running HPC workloads in the background. >>>> >>>> Proposed approach: >>>> ------------------ >>>> >>>> Similar to the windows driver, we could expose a high priority >>>> compute queue to >>>> userspace. >>>> >>>> Submissions to this compute queue will be scheduled with high >>>> priority, and may >>>> acquire hardware resources previously in use by other queues. >>>> >>>> This can be achieved by taking advantage of the 'priority' field in >>>> the HQDs >>>> and could be programmed by amdgpu or the amdgpu scheduler. The >>>> relevant >>>> register fields are: >>>> * mmCP_HQD_PIPE_PRIORITY >>>> * mmCP_HQD_QUEUE_PRIORITY >>>> >>>> Implementation approach 1 - static partitioning: >>>> ------------------------------------------------ >>>> >>>> The amdgpu driver currently controls 8 compute queues from pipe0. >>>> We can >>>> statically partition these as follows: >>>> * 7x regular >>>> * 1x high priority >>>> >>>> The relevant priorities can be set so that submissions to the high >>>> priority >>>> ring will starve the other compute rings and the GFX ring. >>>> >>>> The amdgpu scheduler will only place jobs into the high priority >>>> rings if the >>>> context is marked as high priority. And a corresponding priority >>>> should be >>>> added to keep track of this information: >>>> * AMD_SCHED_PRIORITY_KERNEL >>>> * -> AMD_SCHED_PRIORITY_HIGH >>>> * AMD_SCHED_PRIORITY_NORMAL >>>> >>>> The user will request a high priority context by setting an >>>> appropriate flag >>>> in drm_amdgpu_ctx_in (AMDGPU_CTX_HIGH_PRIORITY or similar): >>>> https://github.com/torvalds/linux/blob/master/include/uapi/drm/amdgpu_drm.h#L163 >>>> >>>> >>>> >>>> The setting is in a per context level so that we can: >>>> * Maintain a consistent FIFO ordering of all submissions to a >>>> context >>>> * Create high priority and non-high priority contexts in the same >>>> process >>>> >>>> Implementation approach 2 - dynamic priority programming: >>>> --------------------------------------------------------- >>>> >>>> Similar to the above, but instead of programming the priorities and >>>> amdgpu_init() time, the SW scheduler will reprogram the queue >>>> priorities >>>> dynamically when scheduling a task. >>>> >>>> This would involve having a hardware specific callback from the >>>> scheduler to >>>> set the appropriate queue priority: set_priority(int ring, int index, >>>> int priority) >>>> >>>> During this callback we would have to grab the SRBM mutex to perform >>>> the appropriate >>>> HW programming, and I'm not really sure if that is something we >>>> should be doing from >>>> the scheduler. >>>> >>>> On the positive side, this approach would allow us to program a >>>> range of >>>> priorities for jobs instead of a single "high priority" value", >>>> achieving >>>> something similar to the niceness API available for CPU scheduling. >>>> >>>> I'm not sure if this flexibility is something that we would need for >>>> our use >>>> case, but it might be useful in other scenarios (multiple users >>>> sharing compute >>>> time on a server). >>>> >>>> This approach would require a new int field in drm_amdgpu_ctx_in, or >>>> repurposing >>>> of the flags field. >>>> >>>> Known current obstacles: >>>> ------------------------ >>>> >>>> The SQ is currently programmed to disregard the HQD priorities, and >>>> instead it picks >>>> jobs at random. Settings from the shader itself are also disregarded >>>> as this is >>>> considered a privileged field. >>>> >>>> Effectively we can get our compute wavefront launched ASAP, but we >>>> might not get the >>>> time we need on the SQ. >>>> >>>> The current programming would have to be changed to allow priority >>>> propagation >>>> from the HQD into the SQ. >>>> >>>> Generic approach for all HW IPs: >>>> -------------------------------- >>>> >>>> For consistency purposes, the high priority context can be enabled >>>> for all HW IPs >>>> with support of the SW scheduler. This will function similarly to the >>>> current >>>> AMD_SCHED_PRIORITY_KERNEL priority, where the job can jump ahead of >>>> anything not >>>> commited to the HW queue. >>>> >>>> The benefits of requesting a high priority context for a non-compute >>>> queue will >>>> be lesser (e.g. up to 10s of wait time if a GFX command is stuck in >>>> front of >>>> you), but having the API in place will allow us to easily improve the >>>> implementation >>>> in the future as new features become available in new hardware. >>>> >>>> Future steps: >>>> ------------- >>>> >>>> Once we have an approach settled, I can take care of the >>>> implementation. >>>> >>>> Also, once the interface is mostly decided, we can start thinking >>>> about >>>> exposing the high priority queue through radv. >>>> >>>> Request for feedback: >>>> --------------------- >>>> >>>> We aren't married to any of the approaches outlined above. Our goal >>>> is to >>>> obtain a mechanism that will allow us to complete the reprojection >>>> job within a >>>> predictable amount of time. So if anyone anyone has any suggestions >>>> for >>>> improvements or alternative strategies we are more than happy to hear >>>> them. >>>> >>>> If any of the technical information above is also incorrect, feel >>>> free to point >>>> out my misunderstandings. >>>> >>>> Looking forward to hearing from you. >>>> >>>> Regards, >>>> Andres >>>> >>>> _______________________________________________ >>>> amd-gfx mailing list >>>> amd-gfx at lists.freedesktop.org >>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx >>>> >>>> >>>> amd-gfx Info Page - lists.freedesktop.org >>>> lists.freedesktop.org >>>> To see the collection of prior postings to the list, visit the >>>> amd-gfx Archives. Using amd-gfx: To post a message to all the list >>>> members, send email ... >>>> >>>> >>>> >>>> amd-gfx Info Page - lists.freedesktop.org >>>> lists.freedesktop.org >>>> To see the collection of prior postings to the list, visit the >>>> amd-gfx Archives. Using amd-gfx: To post a message to all the list >>>> members, send email ... >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> _______________________________________________ >>>> amd-gfx mailing list >>>> amd-gfx at lists.freedesktop.org >>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx >>>> >>> >>> _______________________________________________ >>> amd-gfx mailing list >>> amd-gfx at lists.freedesktop.org >>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx >> >