Yes, vulkan is available on all-open through the mesa radv UMD. I'm not sure if I'm asking for too much, but if we can coordinate a similar interface in radv and amdgpu-pro at the vulkan level that would be great. I'm not sure what that's going to be yet. - Andres On 12/19/2016 12:11 AM, zhoucm1 wrote: > > > On 2016å¹´12æ??19æ?¥ 11:33, Pierre-Loup A. Griffais wrote: >> We're currently working with the open stack; I assume that a >> mechanism could be exposed by both open and Pro Vulkan userspace >> drivers and that the amdgpu kernel interface improvements we would >> pursue following this discussion would let both drivers take >> advantage of the feature, correct? > Of course. > Does open stack have Vulkan support? > > Regards, > David Zhou >> >> On 12/18/2016 07:26 PM, zhoucm1 wrote: >>> By the way, are you using all-open driver or amdgpu-pro driver? >>> >>> +David Mao, who is working on our Vulkan driver. >>> >>> Regards, >>> David Zhou >>> >>> On 2016å¹´12æ??18æ?¥ 06:05, Pierre-Loup A. Griffais wrote: >>>> Hi Serguei, >>>> >>>> I'm also working on the bringing up our VR runtime on top of amgpu; >>>> see replies inline. >>>> >>>> On 12/16/2016 09:05 PM, Sagalovitch, Serguei wrote: >>>>> Andres, >>>>> >>>>>> For current VR workloads we have 3 separate processes running >>>>>> actually: >>>>> So we could have potential memory overcommit case or do you do >>>>> partitioning >>>>> on your own? I would think that there is need to avoid overcomit in >>>>> VR case to >>>>> prevent any BO migration. >>>> >>>> You're entirely correct; currently the VR runtime is setting up >>>> prioritized CPU scheduling for its VR compositor, we're working on >>>> prioritized GPU scheduling and pre-emption (eg. this thread), and in >>>> the future it will make sense to do work in order to make sure that >>>> its memory allocations do not get evicted, to prevent any unwelcome >>>> additional latency in the event of needing to perform just-in-time >>>> reprojection. >>>> >>>>> BTW: Do you mean __real__ processes or threads? >>>>> Based on my understanding sharing BOs between different processes >>>>> could introduce additional synchronization constrains. btw: I am not >>>>> sure >>>>> if we are able to share Vulkan sync. object cross-process boundary. >>>> >>>> They are different processes; it is important for the compositor that >>>> is responsible for quality-of-service features such as consistently >>>> presenting distorted frames with the right latency, reprojection, etc, >>>> to be separate from the main application. >>>> >>>> Currently we are using unreleased cross-process memory and semaphore >>>> extensions to fetch updated eye images from the client application, >>>> but the just-in-time reprojection discussed here does not actually >>>> have any direct interactions with cross-process resource sharing, >>>> since it's achieved by using whatever is the latest, most up-to-date >>>> eye images that have already been sent by the client application, >>>> which are already available to use without additional synchronization. >>>> >>>>> >>>>>> 3) System compositor (we are looking at approaches to remove this >>>>>> overhead) >>>>> Yes, IMHO the best is to run in "full screen mode". >>>> >>>> Yes, we are working on mechanisms to present directly to the headset >>>> display without any intermediaries as a separate effort. >>>> >>>>> >>>>>> The latency is our main concern, >>>>> I would assume that this is the known problem (at least for compute >>>>> usage). >>>>> It looks like that amdgpu / kernel submission is rather CPU intensive >>>>> (at least >>>>> in the default configuration). >>>> >>>> As long as it's a consistent cost, it shouldn't an issue. However, if >>>> there's high degrees of variance then that would be troublesome and we >>>> would need to account for the worst case. >>>> >>>> Hopefully the requirements and approach we described make sense, we're >>>> looking forward to your feedback and suggestions. >>>> >>>> Thanks! >>>> - Pierre-Loup >>>> >>>>> >>>>> Sincerely yours, >>>>> Serguei Sagalovitch >>>>> >>>>> >>>>> From: Andres Rodriguez <andresr at valvesoftware.com> >>>>> Sent: December 16, 2016 10:00 PM >>>>> To: Sagalovitch, Serguei; amd-gfx at lists.freedesktop.org >>>>> Subject: RE: [RFC] Mechanism for high priority scheduling in amdgpu >>>>> >>>>> Hey Serguei, >>>>> >>>>>> [Serguei] No. I mean pipe :-) as MEC define it. As far as I >>>>>> understand (by simplifying) >>>>>> some scheduling is per pipe. I know about the current allocation >>>>>> scheme but I do not think >>>>>> that it is ideal. I would assume that we need to switch to >>>>>> dynamical partition >>>>>> of resources based on the workload otherwise we will have resource >>>>>> conflict >>>>>> between Vulkan compute and OpenCL. >>>>> >>>>> I agree the partitioning isn't ideal. I'm hoping we can start with a >>>>> solution that assumes that >>>>> only pipe0 has any work and the other pipes are idle (no HSA/ROCm >>>>> running on the system). >>>>> >>>>> This should be more or less the use case we expect from VR users. >>>>> >>>>> I agree the split is currently not ideal, but I'd like to consider >>>>> that a separate task, because >>>>> making it dynamic is not straight forward :P >>>>> >>>>>> [Serguei] Vulkan works via amdgpu (kernel submissions) so amdkfd >>>>>> will be not >>>>>> involved. I would assume that in the case of VR we will have one >>>>>> main >>>>>> application ("console" mode(?)) so we could temporally "ignore" >>>>>> OpenCL/ROCm needs when VR is running. >>>>> >>>>> Correct, this is why we want to enable the high priority compute >>>>> queue through >>>>> libdrm-amdgpu, so that we can expose it through Vulkan later. >>>>> >>>>> For current VR workloads we have 3 separate processes running >>>>> actually: >>>>> 1) Game process >>>>> 2) VR Compositor (this is the process that will require high >>>>> priority queue) >>>>> 3) System compositor (we are looking at approaches to remove this >>>>> overhead) >>>>> >>>>> For now I think it is okay to assume no OpenCL/ROCm running >>>>> simultaneously, but >>>>> I would also like to be able to address this case in the future >>>>> (cross-pipe priorities). >>>>> >>>>>> [Serguei] The problem with pre-emption of graphics task: (a) it >>>>>> may take time so >>>>>> latency may suffer >>>>> >>>>> The latency is our main concern, we want something that is >>>>> predictable. A good >>>>> illustration of what the reprojection scheduling looks like can be >>>>> found here: >>>>> https://community.amd.com/servlet/JiveServlet/showImage/38-1310-104754/pastedImage_3.png >>>>> >>>>> >>>>> >>>>>> (b) to preempt we need to have different "context" - we want >>>>>> to guarantee that submissions from the same context will be executed >>>>>> in order. >>>>> >>>>> This is okay, as the reprojection work doesn't have dependencies on >>>>> the game context, and it >>>>> even happens in a separate process. >>>>> >>>>>> BTW: (a) Do you want "preempt" and later resume or do you want >>>>>> "preempt" and >>>>>> "cancel/abort" >>>>> >>>>> Preempt the game with the compositor task and then resume it. >>>>> >>>>>> (b) Vulkan is generic API and could be used for graphics as well as >>>>>> for plain compute tasks (VK_QUEUE_COMPUTE_BIT). >>>>> >>>>> Yeah, the plan is to use vulkan compute. But if you figure out a way >>>>> for us to get >>>>> a guaranteed execution time using vulkan graphics, then I'll take you >>>>> out for a beer :) >>>>> >>>>> Regards, >>>>> Andres >>>>> ________________________________________ >>>>> From: Sagalovitch, Serguei [Serguei.Sagalovitch at amd.com] >>>>> Sent: Friday, December 16, 2016 9:13 PM >>>>> To: Andres Rodriguez; amd-gfx at lists.freedesktop.org >>>>> Subject: Re: [RFC] Mechanism for high priority scheduling in amdgpu >>>>> >>>>> Hi Andres, >>>>> >>>>> Please see inline (as [Serguei]) >>>>> >>>>> Sincerely yours, >>>>> Serguei Sagalovitch >>>>> >>>>> >>>>> From: Andres Rodriguez <andresr at valvesoftware.com> >>>>> Sent: December 16, 2016 8:29 PM >>>>> To: Sagalovitch, Serguei; amd-gfx at lists.freedesktop.org >>>>> Subject: RE: [RFC] Mechanism for high priority scheduling in amdgpu >>>>> >>>>> Hi Serguei, >>>>> >>>>> Thanks for the feedback. Answers inline as [AR]. >>>>> >>>>> Regards, >>>>> Andres >>>>> >>>>> ________________________________________ >>>>> From: Sagalovitch, Serguei [Serguei.Sagalovitch at amd.com] >>>>> Sent: Friday, December 16, 2016 8:15 PM >>>>> To: Andres Rodriguez; amd-gfx at lists.freedesktop.org >>>>> Subject: Re: [RFC] Mechanism for high priority scheduling in amdgpu >>>>> >>>>> Andres, >>>>> >>>>> >>>>> Quick comments: >>>>> >>>>> 1) To minimize "bubbles", etc. we need to "force" CU >>>>> assignments/binding >>>>> to high-priority queue when it will be in use and "free" them later >>>>> (we do not want forever take CUs from e.g. graphic task to degrade >>>>> graphics >>>>> performance). >>>>> >>>>> Otherwise we could have scenario when long graphics task (or >>>>> low-priority >>>>> compute) will took all (extra) CUs and high--priority will wait for >>>>> needed resources. >>>>> It will not be visible on "NOP " but only when you submit "real" >>>>> compute task >>>>> so I would recommend not to use "NOP" packets at all for testing. >>>>> >>>>> It (CU assignment) could be relatively easy done when everything is >>>>> going via kernel >>>>> (e.g. as part of frame submission) but I must admit that I am not >>>>> sure >>>>> about the best way for user level submissions (amdkfd). >>>>> >>>>> [AR] I wasn't aware of this part of the programming sequence. Thanks >>>>> for the heads up! >>>>> Is this similar to the CU masking programming? >>>>> [Serguei] Yes. To simplify: the problem is that "scheduler" when >>>>> deciding which >>>>> queue to run will check if there is enough resources and if not then >>>>> it will begin >>>>> to check other queues with lower priority. >>>>> >>>>> 2) I would recommend to dedicate the whole pipe to high-priority >>>>> queue and have >>>>> nothing their except it. >>>>> >>>>> [AR] I'm guessing in this context you mean pipe = queue? (as opposed >>>>> to the MEC definition >>>>> of pipe, which is a grouping of queues). I say this because amdgpu >>>>> only has access to 1 pipe, >>>>> and the rest are statically partitioned for amdkfd usage. >>>>> >>>>> [Serguei] No. I mean pipe :-) as MEC define it. As far as I >>>>> understand (by simplifying) >>>>> some scheduling is per pipe. I know about the current allocation >>>>> scheme but I do not think >>>>> that it is ideal. I would assume that we need to switch to >>>>> dynamical partition >>>>> of resources based on the workload otherwise we will have resource >>>>> conflict >>>>> between Vulkan compute and OpenCL. >>>>> >>>>> >>>>> BTW: Which user level API do you want to use for compute: Vulkan or >>>>> OpenCL? >>>>> >>>>> [AR] Vulkan >>>>> >>>>> [Serguei] Vulkan works via amdgpu (kernel submissions) so amdkfd will >>>>> be not >>>>> involved. I would assume that in the case of VR we will have one >>>>> main >>>>> application ("console" mode(?)) so we could temporally "ignore" >>>>> OpenCL/ROCm needs when VR is running. >>>>> >>>>>> we will not be able to provide a solution compatible with GFX >>>>>> worloads. >>>>> I assume that you are talking about graphics? Am I right? >>>>> >>>>> [AR] Yeah, my understanding is that pre-empting the currently running >>>>> graphics job and scheduling in >>>>> something else using mid-buffer pre-emption has some cases where it >>>>> doesn't work well. But if with >>>>> polaris10 it starts working well, it might be a better solution for >>>>> us (because the whole reprojection >>>>> work uses the vulkan graphics stack at the moment, and porting it to >>>>> compute is not trivial). >>>>> >>>>> [Serguei] The problem with pre-emption of graphics task: (a) it may >>>>> take time so >>>>> latency may suffer (b) to preempt we need to have different "context" >>>>> - we want >>>>> to guarantee that submissions from the same context will be executed >>>>> in order. >>>>> BTW: (a) Do you want "preempt" and later resume or do you want >>>>> "preempt" and >>>>> "cancel/abort"? (b) Vulkan is generic API and could be used >>>>> for graphics as well as for plain compute tasks >>>>> (VK_QUEUE_COMPUTE_BIT). >>>>> >>>>> >>>>> Sincerely yours, >>>>> Serguei Sagalovitch >>>>> >>>>> >>>>> >>>>> From: amd-gfx <amd-gfx-bounces at lists.freedesktop.org> on behalf of >>>>> Andres Rodriguez <andresr at valvesoftware.com> >>>>> Sent: December 16, 2016 6:15 PM >>>>> To: amd-gfx at lists.freedesktop.org >>>>> Subject: [RFC] Mechanism for high priority scheduling in amdgpu >>>>> >>>>> Hi Everyone, >>>>> >>>>> This RFC is also available as a gist here: >>>>> https://gist.github.com/lostgoat/7000432cd6864265dbc2c3ab93204249 >>>>> >>>>> >>>>> >>>>> [RFC] Mechanism for high priority scheduling in amdgpu >>>>> gist.github.com >>>>> [RFC] Mechanism for high priority scheduling in amdgpu >>>>> >>>>> >>>>> >>>>> [RFC] Mechanism for high priority scheduling in amdgpu >>>>> gist.github.com >>>>> [RFC] Mechanism for high priority scheduling in amdgpu >>>>> >>>>> >>>>> >>>>> >>>>> [RFC] Mechanism for high priority scheduling in amdgpu >>>>> gist.github.com >>>>> [RFC] Mechanism for high priority scheduling in amdgpu >>>>> >>>>> >>>>> We are interested in feedback for a mechanism to effectively schedule >>>>> high >>>>> priority VR reprojection tasks (also referred to as time-warping) for >>>>> Polaris10 >>>>> running on the amdgpu kernel driver. >>>>> >>>>> Brief context: >>>>> -------------- >>>>> >>>>> The main objective of reprojection is to avoid motion sickness for VR >>>>> users in >>>>> scenarios where the game or application would fail to finish >>>>> rendering a new >>>>> frame in time for the next VBLANK. When this happens, the user's head >>>>> movements >>>>> are not reflected on the Head Mounted Display (HMD) for the duration >>>>> of an >>>>> extra frame. This extended mismatch between the inner ear and the >>>>> eyes may >>>>> cause the user to experience motion sickness. >>>>> >>>>> The VR compositor deals with this problem by fabricating a new frame >>>>> using the >>>>> user's updated head position in combination with the previous frames. >>>>> This >>>>> avoids a prolonged mismatch between the HMD output and the inner ear. >>>>> >>>>> Because of the adverse effects on the user, we require high >>>>> confidence that the >>>>> reprojection task will complete before the VBLANK interval. Even if >>>>> the GFX pipe >>>>> is currently full of work from the game/application (which is most >>>>> likely the case). >>>>> >>>>> For more details and illustrations, please refer to the following >>>>> document: >>>>> https://community.amd.com/community/gaming/blog/2016/03/28/asynchronous-shaders-evolved >>>>> >>>>> >>>>> >>>>> >>>>> Gaming: Asynchronous Shaders Evolved | Community >>>>> community.amd.com >>>>> One of the most exciting new developments in GPU technology over the >>>>> past year has been the adoption of asynchronous shaders, which can >>>>> make more efficient use of ... >>>>> >>>>> >>>>> >>>>> Gaming: Asynchronous Shaders Evolved | Community >>>>> community.amd.com >>>>> One of the most exciting new developments in GPU technology over the >>>>> past year has been the adoption of asynchronous shaders, which can >>>>> make more efficient use of ... >>>>> >>>>> >>>>> >>>>> Gaming: Asynchronous Shaders Evolved | Community >>>>> community.amd.com >>>>> One of the most exciting new developments in GPU technology over the >>>>> past year has been the adoption of asynchronous shaders, which can >>>>> make more efficient use of ... >>>>> >>>>> >>>>> Requirements: >>>>> ------------- >>>>> >>>>> The mechanism must expose the following functionaility: >>>>> >>>>> * Job round trip time must be predictable, from submission to >>>>> fence signal >>>>> >>>>> * The mechanism must support compute workloads. >>>>> >>>>> Goals: >>>>> ------ >>>>> >>>>> * The mechanism should provide low submission latencies >>>>> >>>>> Test: submitting a NOP packet through the mechanism on busy hardware >>>>> should >>>>> be equivalent to submitting a NOP on idle hardware. >>>>> >>>>> Nice to have: >>>>> ------------- >>>>> >>>>> * The mechanism should also support GFX workloads. >>>>> >>>>> My understanding is that with the current hardware capabilities in >>>>> Polaris10 we >>>>> will not be able to provide a solution compatible with GFX worloads. >>>>> >>>>> But I would love to hear otherwise. So if anyone has an idea, >>>>> approach or >>>>> suggestion that will also be compatible with the GFX ring, please let >>>>> us know >>>>> about it. >>>>> >>>>> * The above guarantees should also be respected by amdkfd >>>>> workloads >>>>> >>>>> Would be good to have for consistency, but not strictly necessary as >>>>> users running >>>>> games are not traditionally running HPC workloads in the background. >>>>> >>>>> Proposed approach: >>>>> ------------------ >>>>> >>>>> Similar to the windows driver, we could expose a high priority >>>>> compute queue to >>>>> userspace. >>>>> >>>>> Submissions to this compute queue will be scheduled with high >>>>> priority, and may >>>>> acquire hardware resources previously in use by other queues. >>>>> >>>>> This can be achieved by taking advantage of the 'priority' field in >>>>> the HQDs >>>>> and could be programmed by amdgpu or the amdgpu scheduler. The >>>>> relevant >>>>> register fields are: >>>>> * mmCP_HQD_PIPE_PRIORITY >>>>> * mmCP_HQD_QUEUE_PRIORITY >>>>> >>>>> Implementation approach 1 - static partitioning: >>>>> ------------------------------------------------ >>>>> >>>>> The amdgpu driver currently controls 8 compute queues from pipe0. >>>>> We can >>>>> statically partition these as follows: >>>>> * 7x regular >>>>> * 1x high priority >>>>> >>>>> The relevant priorities can be set so that submissions to the high >>>>> priority >>>>> ring will starve the other compute rings and the GFX ring. >>>>> >>>>> The amdgpu scheduler will only place jobs into the high priority >>>>> rings if the >>>>> context is marked as high priority. And a corresponding priority >>>>> should be >>>>> added to keep track of this information: >>>>> * AMD_SCHED_PRIORITY_KERNEL >>>>> * -> AMD_SCHED_PRIORITY_HIGH >>>>> * AMD_SCHED_PRIORITY_NORMAL >>>>> >>>>> The user will request a high priority context by setting an >>>>> appropriate flag >>>>> in drm_amdgpu_ctx_in (AMDGPU_CTX_HIGH_PRIORITY or similar): >>>>> https://github.com/torvalds/linux/blob/master/include/uapi/drm/amdgpu_drm.h#L163 >>>>> >>>>> >>>>> >>>>> The setting is in a per context level so that we can: >>>>> * Maintain a consistent FIFO ordering of all submissions to a >>>>> context >>>>> * Create high priority and non-high priority contexts in the same >>>>> process >>>>> >>>>> Implementation approach 2 - dynamic priority programming: >>>>> --------------------------------------------------------- >>>>> >>>>> Similar to the above, but instead of programming the priorities and >>>>> amdgpu_init() time, the SW scheduler will reprogram the queue >>>>> priorities >>>>> dynamically when scheduling a task. >>>>> >>>>> This would involve having a hardware specific callback from the >>>>> scheduler to >>>>> set the appropriate queue priority: set_priority(int ring, int index, >>>>> int priority) >>>>> >>>>> During this callback we would have to grab the SRBM mutex to perform >>>>> the appropriate >>>>> HW programming, and I'm not really sure if that is something we >>>>> should be doing from >>>>> the scheduler. >>>>> >>>>> On the positive side, this approach would allow us to program a >>>>> range of >>>>> priorities for jobs instead of a single "high priority" value", >>>>> achieving >>>>> something similar to the niceness API available for CPU scheduling. >>>>> >>>>> I'm not sure if this flexibility is something that we would need for >>>>> our use >>>>> case, but it might be useful in other scenarios (multiple users >>>>> sharing compute >>>>> time on a server). >>>>> >>>>> This approach would require a new int field in drm_amdgpu_ctx_in, or >>>>> repurposing >>>>> of the flags field. >>>>> >>>>> Known current obstacles: >>>>> ------------------------ >>>>> >>>>> The SQ is currently programmed to disregard the HQD priorities, and >>>>> instead it picks >>>>> jobs at random. Settings from the shader itself are also disregarded >>>>> as this is >>>>> considered a privileged field. >>>>> >>>>> Effectively we can get our compute wavefront launched ASAP, but we >>>>> might not get the >>>>> time we need on the SQ. >>>>> >>>>> The current programming would have to be changed to allow priority >>>>> propagation >>>>> from the HQD into the SQ. >>>>> >>>>> Generic approach for all HW IPs: >>>>> -------------------------------- >>>>> >>>>> For consistency purposes, the high priority context can be enabled >>>>> for all HW IPs >>>>> with support of the SW scheduler. This will function similarly to the >>>>> current >>>>> AMD_SCHED_PRIORITY_KERNEL priority, where the job can jump ahead of >>>>> anything not >>>>> commited to the HW queue. >>>>> >>>>> The benefits of requesting a high priority context for a non-compute >>>>> queue will >>>>> be lesser (e.g. up to 10s of wait time if a GFX command is stuck in >>>>> front of >>>>> you), but having the API in place will allow us to easily improve the >>>>> implementation >>>>> in the future as new features become available in new hardware. >>>>> >>>>> Future steps: >>>>> ------------- >>>>> >>>>> Once we have an approach settled, I can take care of the >>>>> implementation. >>>>> >>>>> Also, once the interface is mostly decided, we can start thinking >>>>> about >>>>> exposing the high priority queue through radv. >>>>> >>>>> Request for feedback: >>>>> --------------------- >>>>> >>>>> We aren't married to any of the approaches outlined above. Our goal >>>>> is to >>>>> obtain a mechanism that will allow us to complete the reprojection >>>>> job within a >>>>> predictable amount of time. So if anyone anyone has any >>>>> suggestions for >>>>> improvements or alternative strategies we are more than happy to hear >>>>> them. >>>>> >>>>> If any of the technical information above is also incorrect, feel >>>>> free to point >>>>> out my misunderstandings. >>>>> >>>>> Looking forward to hearing from you. >>>>> >>>>> Regards, >>>>> Andres >>>>> >>>>> _______________________________________________ >>>>> amd-gfx mailing list >>>>> amd-gfx at lists.freedesktop.org >>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx >>>>> >>>>> >>>>> amd-gfx Info Page - lists.freedesktop.org >>>>> lists.freedesktop.org >>>>> To see the collection of prior postings to the list, visit the >>>>> amd-gfx Archives. Using amd-gfx: To post a message to all the list >>>>> members, send email ... >>>>> >>>>> >>>>> >>>>> amd-gfx Info Page - lists.freedesktop.org >>>>> lists.freedesktop.org >>>>> To see the collection of prior postings to the list, visit the >>>>> amd-gfx Archives. Using amd-gfx: To post a message to all the list >>>>> members, send email ... >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> _______________________________________________ >>>>> amd-gfx mailing list >>>>> amd-gfx at lists.freedesktop.org >>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx >>>>> >>>> >>>> _______________________________________________ >>>> amd-gfx mailing list >>>> amd-gfx at lists.freedesktop.org >>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx >>> >> > > _______________________________________________ > amd-gfx mailing list > amd-gfx at lists.freedesktop.org > https://lists.freedesktop.org/mailman/listinfo/amd-gfx