Indeed a couple of nice numbers. > but everything already commited > to the HW queue is executed in strict FIFO order. Well actually if we get a high priority submission we could preempt/abort everything on the ring buffer before it in theory. Probably not as fine granularity as the hardware scheduler, but might be easier to get working. Regards, Christian. Am 26.12.2016 um 03:26 schrieb zhoucm1: > Nice experiment, which is exactly SW scheduler can provide. > And as you said "I.e. your context can be scheduled into the > HW queue ahead of any other context, but everything already commited > to the HW queue is executed in strict FIFO order." > > If you want to keep consistent latency, which will need to enable hw > priority queue feature. > > Regards, > David Zhou > > On 2016å¹´12æ??24æ?¥ 06:20, Andres Rodriguez wrote: >> Hey John, >> >> I've collected bit of data using high priority SW scheduler queues, >> thought you might be interested. >> >> Implementation as per the patch above. >> >> Control test 1 >> ============== >> >> Sascha Willems mesh sample running on its own at regular priority >> >> Results >> ------- >> >> Mesh: ~0.14ms per-frame latency >> >> Control test 2 >> ============== >> >> Two Sascha Willems mesh sample running on its own at regular priority >> >> Results >> ------- >> >> Mesh 1: ~0.26ms per-frame latency >> Mesh 2: ~0.26ms per-frame latency >> >> Test 1 >> ====== >> >> Two Sascha Willems mesh samples running simultaneously. One at high >> priority and the other running in a regular priority graphics context. >> >> Results >> ------- >> >> Mesh High: 0.14 - 0.24ms per-frame latency >> Mesh Regular: 0.24 - 0.40ms per-frame latency >> >> Test 2 >> ====== >> >> Ten Sascha Willems mesh samples running simultaneously. One at high >> priority and the others running in a regular priority graphics context. >> >> Results >> ------- >> >> Mesh High: 0.14 - 0.8ms per-frame latency >> Mesh Regular: 1.10 - 2.05ms per-frame latency >> >> Test 3 >> ====== >> >> Two Sascha Willems mesh samples running simultaneously. One at high >> priority and the other running in a regular priority graphics context. >> >> Also running Unigine Heaven at Exteme preset @ 2560x1600 >> >> Results >> ------- >> >> Mesh High: 7 - 100ms per-frame latency (Lots of fluctuation) >> Mesh Regular: 40 - 130ms per-frame latency(Lots of fluctuation) >> Unigine Heaven: 20-40 fps >> >> >> Test 4 >> ====== >> >> Two Sascha Willems mesh samples running simultaneously. One at high >> priority and the other running in a regular priority graphics context. >> >> Also running Talos Principle @ 4K >> >> Results >> ------- >> >> Mesh High: 0.14 - 3.97ms per-frame latency (Mostly floats ~0.4ms) >> Mesh Regular: 0.43 - 8.11ms per-frame latency (Lots of fluctuation) >> Talos: 24.8 fps AVG >> >> Observations >> ============ >> >> The high priority queue based on the SW scheduler provides significant >> gains when paired with tasks that submit short duration commands into >> the queue. This can be observed in tests 1 and 2. >> >> When the pipe is full of long running commands, the effects are dampened. >> As observed in test 3, the per-frame latency suffers very large spikes, >> and the latencies are very inconsistent. >> >> Talos seems to be a better behaved game. It may be submitting shorter >> draw commands and the SW scheduler is able to interleave the rest of >> the work. >> >> The results seem consistent with the hypothetical advantages the SW >> scheduler should provide. I.e. your context can be scheduled into the >> HW queue ahead of any other context, but everything already commited >> to the HW queue is executed in strict FIFO order. >> >> In order to deal with cases similar to Test 3, we will need to take >> advantage of further features. >> >> Notes >> ===== >> >> - Tests were run multiple times, and reboots were performed during tests. >> - The mesh sample isn't really designed for benchmarking, but it should >> be decent for ballpark figures >> - The high priority mesh app was run with default niceness and also >> niceness >> at -20. This had no effect on the results, so it was not added above. >> - CPU usage was not saturated while running the tests >> >> Regards, >> Andres >> >> >> On Fri, Dec 23, 2016 at 1:18 PM, Pierre-Loup A. Griffais >> <pgriffais at valvesoftware.com <mailto:pgriffais at valvesoftware.com>> wrote: >> >> I hate to keep bringing up display topics in an unrelated >> conversation, but I'm not sure where you got "Application -> X >> server -> compositor -> X server" from. As I was saying before, >> we need to be presenting directly to the HMD display as no >> display server can be in the way, both for latency but also >> quality of service reasons (a buggy application cannot be allowed >> to accidentally display undistorted rendering into the HMD); we >> intend to do the necessary work for this, and the extent of X's >> (or a Wayland implementation, or any other display server) >> involvment will be to participate enough to know that the HMD >> display is off-limits. If you have more questions on the display >> aspect, or VR rendering in general, I'm happy to try to address >> them out-of-band from this conversation. >> >> >> On 12/23/2016 02:54 AM, Christian König wrote: >> >> But yes, in general you don't want another compositor in >> the way, so >> we'll be acquiring the HMD display directly, separate >> from any desktop >> or display server. >> >> Assuming that the the HMD is attached to the rendering device >> in some >> way you have the X server and the Compositor which both try >> to be DRM >> master at the same time. >> >> Please correct me if that was fixed in the meantime, but that >> sounds >> like it will simply not work. Or is this what Andres mention >> below Dave >> is working on ?. >> >> Additional to that a compositor in combination with X is a >> bit counter >> productive when you want to keep the latency low. >> >> E.g. the "normal" flow of a GL or Vulkan surface filled with >> rendered >> data to be displayed is from the Application -> X server -> >> compositor >> -> X server. >> >> The extra step between X server and compositor just means >> extra latency >> and for this use case you probably don't want that. >> >> Targeting something like Wayland and when you need X >> compatibility >> XWayland sounds like the much better idea. >> >> Regards, >> Christian. >> >> Am 22.12.2016 um 20:54 schrieb Pierre-Loup A. Griffais: >> >> Display concerns are a separate issue, and as Andres said >> we have >> other plans to address. But yes, in general you don't >> want another >> compositor in the way, so we'll be acquiring the HMD >> display directly, >> separate from any desktop or display server. Same with >> security, we >> can have a separate conversation about that when the time >> comes. >> >> On 12/22/2016 08:41 AM, Serguei Sagalovitch wrote: >> >> Andres, >> >> Did you measure latency, etc. impact of __any__ >> compositor? >> >> My understanding is that VR has pretty strict >> requirements related to >> QoS. >> >> Sincerely yours, >> Serguei Sagalovitch >> >> >> On 2016-12-22 11:35 AM, Andres Rodriguez wrote: >> >> Hey Christian, >> >> We are currently interested in X, but with some >> distros switching to >> other compositors by default, we also need to >> consider those. >> >> We agree, running the full vrcompositor in root >> isn't something that >> we want to do. Too many security concerns. Having >> a small root helper >> that does the privilege escalation for us is the >> initial idea. >> >> For a long term approach, Pierre-Loup and Dave >> are working on dealing >> with the "two compositors" scenario a little >> better in DRM+X. >> Fullscreen isn't really a sufficient approach, >> since we don't want the >> HMD to be used as part of the Desktop environment >> when a VR app is not >> in use (this is extremely annoying). >> >> When the above is settled, we should have an auth >> mechanism besides >> DRM_MASTER or DRM_AUTH that allows the >> vrcompositor to take over the >> HMD permanently away from X. Re-using that auth >> method to gate this >> IOCTL is probably going to be the final solution. >> >> I propose to start with ROOT_ONLY since it should >> allow us to respect >> kernel IOCTL compatibility guidelines with the >> most flexibility. Going >> from a restrictive to a more flexible permission >> model would be >> inclusive, but going from a general to a >> restrictive model may exclude >> some apps that used to work. >> >> Regards, >> Andres >> >> On 12/22/2016 6:42 AM, Christian König wrote: >> >> Hi Andres, >> >> well using root might cause stability and >> security problems as well. >> We worked quite hard to avoid exactly this for X. >> >> We could make this feature depend on the >> compositor being DRM master, >> but for example with X the X server is master >> (and e.g. can change >> resolutions etc..) and not the compositor. >> >> So another question is also what windowing >> system (if any) are you >> planning to use? X, Wayland, Flinger or >> something completely >> different ? >> >> Regards, >> Christian. >> >> Am 20.12.2016 um 16:51 schrieb Andres Rodriguez: >> >> Hi Christian, >> >> That is definitely a concern. What we are >> currently thinking is to >> make the high priority queues accessible >> to root only. >> >> Therefore is a non-root user attempts to >> set the high priority flag >> on context allocation, we would fail the >> call and return ENOPERM. >> >> Regards, >> Andres >> >> >> On 12/20/2016 7:56 AM, Christian König wrote: >> >> BTW: If there is non-VR >> application which will use >> high-priority >> h/w queue then VR application >> will suffer. Any ideas how >> to solve it? >> >> Yeah, that problem came to my mind as >> well. >> >> Basically we need to restrict those >> high priority submissions to >> the VR compositor or otherwise any >> malfunctioning application could >> use it. >> >> Just think about some WebGL suddenly >> taking all our rendering away >> and we won't get anything drawn any more. >> >> Alex or Michel any ideas on that? >> >> Regards, >> Christian. >> >> Am 19.12.2016 um 15:48 schrieb >> Serguei Sagalovitch: >> >> > If compute queue is occupied >> only by you, the efficiency >> > is equal with setting job queue >> to high priority I think. >> The only risk is the situation >> when graphics will take all >> needed CUs. But in any case it >> should be very good test. >> >> Andres/Pierre-Loup, >> >> Did you try to do it or it is a >> lot of work for you? >> >> >> BTW: If there is non-VR >> application which will use >> high-priority >> h/w queue then VR application >> will suffer. Any ideas how >> to solve it? >> >> Sincerely yours, >> Serguei Sagalovitch >> >> On 2016-12-19 12:50 AM, zhoucm1 >> wrote: >> >> Do you encounter the priority >> issue for compute queue with >> current driver? >> >> If compute queue is occupied >> only by you, the efficiency >> is equal >> with setting job queue to >> high priority I think. >> >> Regards, >> David Zhou >> >> On 2016å¹´12æ??19æ?¥ 13:29, Andres >> Rodriguez wrote: >> >> Yes, vulkan is available >> on all-open through the >> mesa radv UMD. >> >> I'm not sure if I'm >> asking for too much, but >> if we can >> coordinate a similar >> interface in radv and >> amdgpu-pro at the >> vulkan level that would >> be great. >> >> I'm not sure what that's >> going to be yet. >> >> - Andres >> >> On 12/19/2016 12:11 AM, >> zhoucm1 wrote: >> >> >> >> On 2016å¹´12æ??19æ?¥ 11:33, >> Pierre-Loup A. >> Griffais wrote: >> >> We're currently >> working with the >> open stack; I >> assume that a >> mechanism could >> be exposed by >> both open and Pro >> Vulkan >> userspace drivers >> and that the >> amdgpu kernel >> interface >> improvements we >> would pursue >> following this >> discussion would >> let both drivers >> take advantage of >> the feature, correct? >> >> Of course. >> Does open stack have >> Vulkan support? >> >> Regards, >> David Zhou >> >> >> On 12/18/2016 >> 07:26 PM, zhoucm1 >> wrote: >> >> By the way, >> are you using >> all-open >> driver or >> amdgpu-pro >> driver? >> >> +David Mao, >> who is >> working on >> our Vulkan >> driver. >> >> Regards, >> David Zhou >> >> On >> 2016å¹´12æ??18æ?¥ >> 06:05, >> Pierre-Loup >> A. Griffais >> wrote: >> >> Hi Serguei, >> >> I'm also >> working >> on the >> bringing >> up our VR >> runtime >> on top of >> amgpu; >> see >> replies >> inline. >> >> On >> 12/16/2016 >> 09:05 PM, >> Sagalovitch, >> Serguei >> wrote: >> >> Andres, >> >> For >> current >> VR >> workloads >> we >> have >> 3 >> separate >> processes >> running >> actually: >> >> So we >> could >> have >> potential >> memory >> overcommit >> case >> or do >> you do >> partitioning >> on >> your >> own? >> I >> would >> think >> that >> there >> is >> need >> to avoid >> overcomit >> in >> VR >> case to >> prevent >> any >> BO >> migration. >> >> >> You're >> entirely >> correct; >> currently >> the VR >> runtime is >> setting up >> prioritized >> CPU >> scheduling >> for its >> VR >> compositor, >> we're >> working on >> prioritized >> GPU >> scheduling >> and >> pre-emption >> (eg. this >> thread), >> and in >> the >> future it >> will make >> sense to >> do work >> in order >> to make >> sure that >> its >> memory >> allocations >> do not >> get >> evicted, >> to >> prevent any >> unwelcome >> additional >> latency >> in the >> event of >> needing >> to perform >> just-in-time >> reprojection. >> >> BTW: >> Do >> you >> mean >> __real__ >> processes >> or >> threads? >> Based >> on my >> understanding >> sharing >> BOs >> between >> different >> processes >> could >> introduce >> additional >> synchronization >> constrains. >> btw: >> I am not >> sure >> if we >> are >> able >> to >> share >> Vulkan >> sync. >> object >> cross-process >> boundary. >> >> >> They are >> different >> processes; >> it is >> important >> for the >> compositor >> that >> is >> responsible >> for >> quality-of-service >> features >> such as >> consistently >> presenting >> distorted >> frames >> with the >> right >> latency, >> reprojection, >> etc, >> to be >> separate >> from the >> main >> application. >> >> Currently >> we are >> using >> unreleased >> cross-process >> memory and >> semaphore >> extensions >> to fetch >> updated >> eye >> images >> from the >> client >> application, >> but the >> just-in-time >> reprojection >> discussed >> here does not >> actually >> have any >> direct >> interactions >> with >> cross-process >> resource >> sharing, >> since >> it's >> achieved >> by using >> whatever >> is the >> latest, most >> up-to-date >> eye >> images >> that have >> already >> been sent >> by the client >> application, >> which are >> already >> available >> to use >> without >> additional >> synchronization. >> >> >> >> 3) >> System >> compositor >> (we >> are >> looking >> at >> approaches >> to >> remove >> this >> overhead) >> >> Yes, >> IMHO >> the >> best >> is to >> run >> in >> "full >> screen >> mode". >> >> >> Yes, we >> are >> working >> on >> mechanisms >> to >> present >> directly >> to the >> headset >> display >> without >> any >> intermediaries >> as a >> separate >> effort. >> >> >> The >> latency >> is >> our >> main >> concern, >> >> I >> would >> assume >> that >> this >> is >> the >> known >> problem >> (at >> least for >> compute >> usage). >> It >> looks >> like >> that >> amdgpu >> / >> kernel >> submission >> is >> rather >> CPU >> intensive >> (at least >> in >> the >> default >> configuration). >> >> >> As long >> as it's a >> consistent >> cost, it >> shouldn't >> an issue. >> However, if >> there's >> high >> degrees >> of >> variance >> then that >> would be >> troublesome >> and we >> would >> need to >> account >> for the >> worst case. >> >> Hopefully >> the >> requirements >> and >> approach >> we >> described >> make >> sense, we're >> looking >> forward >> to your >> feedback >> and >> suggestions. >> >> Thanks! >> - >> Pierre-Loup >> >> >> Sincerely >> yours, >> Serguei >> Sagalovitch >> >> >> From: >> Andres >> Rodriguez >> <andresr at valvesoftware.com >> <mailto:andresr at valvesoftware.com>> >> Sent: >> December >> 16, >> 2016 >> 10:00 PM >> To: >> Sagalovitch, >> Serguei; >> amd-gfx at lists.freedesktop.org >> <mailto:amd-gfx at lists.freedesktop.org> >> Subject: >> RE: >> [RFC] >> Mechanism >> for >> high >> priority >> scheduling >> in amdgpu >> >> Hey >> Serguei, >> >> [Serguei] >> No. >> I >> mean >> pipe >> :-) >> as >> MEC >> define >> it. >> As >> far >> as I >> understand >> (by >> simplifying) >> some >> scheduling >> is >> per >> pipe. >> I >> know >> about >> the >> current >> allocation >> scheme >> but >> I >> do >> not >> think >> that >> it >> is >> ideal. >> I >> would >> assume >> that >> we >> need >> to >> switch >> to >> dynamical >> partition >> of >> resources >> based >> on >> the >> workload >> otherwise >> we >> will >> have >> resource >> conflict >> between >> Vulkan >> compute >> and >> OpenCL. >> >> >> I >> agree >> the >> partitioning >> isn't >> ideal. >> I'm >> hoping >> we can >> start >> with a >> solution >> that >> assumes >> that >> only >> pipe0 >> has >> any >> work >> and >> the >> other >> pipes >> are >> idle (no >> HSA/ROCm >> running >> on >> the >> system). >> >> This >> should >> be >> more >> or >> less >> the >> use >> case >> we >> expect >> from VR >> users. >> >> I >> agree >> the >> split >> is >> currently >> not >> ideal, >> but >> I'd >> like to >> consider >> that >> a >> separate >> task, >> because >> making >> it >> dynamic >> is >> not >> straight >> forward >> :P >> >> [Serguei] >> Vulkan >> works >> via >> amdgpu >> (kernel >> submissions) >> so >> amdkfd >> will >> be >> not >> involved. >> I >> would >> assume >> that >> in >> the >> case >> of >> VR >> we >> will >> have >> one >> main >> application >> ("console" >> mode(?)) >> so >> we >> could >> temporally >> "ignore" >> OpenCL/ROCm >> needs >> when >> VR >> is >> running. >> >> >> Correct, >> this >> is >> why >> we >> want >> to >> enable >> the >> high >> priority >> compute >> queue >> through >> libdrm-amdgpu, >> so >> that >> we >> can >> expose >> it >> through >> Vulkan >> later. >> >> For >> current >> VR >> workloads >> we >> have >> 3 >> separate >> processes >> running >> actually: >> >> 1) >> Game >> process >> >> 2) VR >> Compositor >> (this >> is >> the >> process >> that >> will >> require >> high >> priority >> queue) >> >> 3) >> System >> compositor >> (we >> are >> looking >> at >> approaches >> to >> remove >> this >> overhead) >> >> For >> now I >> think >> it is >> okay >> to >> assume >> no >> OpenCL/ROCm >> running >> simultaneously, >> but >> I >> would >> also >> like >> to be >> able >> to >> address >> this >> case >> in the >> future >> (cross-pipe >> priorities). >> >> [Serguei] >> The >> problem >> with >> pre-emption >> of >> graphics >> task: >> (a) >> it >> may >> take >> time >> so >> latency >> may >> suffer >> >> >> The >> latency >> is >> our >> main >> concern, >> we >> want >> something >> that is >> predictable. >> A good >> illustration >> of >> what >> the >> reprojection >> scheduling >> looks >> like >> can be >> found >> here: >> https://community.amd.com/servlet/JiveServlet/showImage/38-1310-104754/pastedImage_3.png >> <https://community.amd.com/servlet/JiveServlet/showImage/38-1310-104754/pastedImage_3.png> >> >> >> >> >> (b) >> to >> preempt >> we >> need >> to >> have >> different >> "context" >> - we >> want >> to >> guarantee >> that >> submissions >> from >> the >> same >> context >> will >> be >> executed >> in >> order. >> >> >> This >> is >> okay, >> as >> the >> reprojection >> work >> doesn't >> have >> dependencies >> on >> the >> game >> context, >> and it >> even >> happens >> in a >> separate >> process. >> >> BTW: >> (a) >> Do >> you >> want >> "preempt" >> and >> later >> resume >> or >> do >> you >> want >> "preempt" >> and >> "cancel/abort" >> >> >> Preempt >> the >> game >> with >> the >> compositor >> task >> and >> then >> resume >> it. >> >> (b) >> Vulkan >> is >> generic >> API >> and >> could >> be >> used >> for >> graphics >> as >> well >> as >> for >> plain >> compute >> tasks >> (VK_QUEUE_COMPUTE_BIT). >> >> >> Yeah, >> the >> plan >> is to >> use >> vulkan >> compute. >> But >> if >> you >> figure >> out a way >> for >> us to get >> a >> guaranteed >> execution >> time >> using >> vulkan >> graphics, >> then >> I'll >> take you >> out >> for a >> beer :) >> >> Regards, >> Andres >> ________________________________________ >> From: >> Sagalovitch, >> Serguei >> [Serguei.Sagalovitch at amd.com >> <mailto:Serguei.Sagalovitch at amd.com>] >> Sent: >> Friday, >> December >> 16, >> 2016 >> 9:13 PM >> To: >> Andres >> Rodriguez; >> amd-gfx at lists.freedesktop.org >> <mailto:amd-gfx at lists.freedesktop.org> >> Subject: >> Re: >> [RFC] >> Mechanism >> for >> high >> priority >> scheduling >> in amdgpu >> >> Hi >> Andres, >> >> Please >> see >> inline >> (as >> [Serguei]) >> >> Sincerely >> yours, >> Serguei >> Sagalovitch >> >> >> From: >> Andres >> Rodriguez >> <andresr at valvesoftware.com >> <mailto:andresr at valvesoftware.com>> >> Sent: >> December >> 16, >> 2016 >> 8:29 PM >> To: >> Sagalovitch, >> Serguei; >> amd-gfx at lists.freedesktop.org >> <mailto:amd-gfx at lists.freedesktop.org> >> Subject: >> RE: >> [RFC] >> Mechanism >> for >> high >> priority >> scheduling >> in amdgpu >> >> Hi >> Serguei, >> >> Thanks >> for >> the >> feedback. >> Answers >> inline >> as [AR]. >> >> Regards, >> Andres >> >> ________________________________________ >> From: >> Sagalovitch, >> Serguei >> [Serguei.Sagalovitch at amd.com >> <mailto:Serguei.Sagalovitch at amd.com>] >> Sent: >> Friday, >> December >> 16, >> 2016 >> 8:15 PM >> To: >> Andres >> Rodriguez; >> amd-gfx at lists.freedesktop.org >> <mailto:amd-gfx at lists.freedesktop.org> >> Subject: >> Re: >> [RFC] >> Mechanism >> for >> high >> priority >> scheduling >> in amdgpu >> >> Andres, >> >> >> Quick >> comments: >> >> 1) To >> minimize >> "bubbles", >> etc. >> we >> need >> to >> "force" >> CU >> assignments/binding >> to >> high-priority >> queue >> when >> it >> will >> be in >> use >> and >> "free" >> them >> later >> (we >> do >> not >> want >> forever >> take >> CUs >> from >> e.g. >> graphic >> task to >> degrade >> graphics >> performance). >> >> Otherwise >> we >> could >> have >> scenario >> when >> long >> graphics >> task (or >> low-priority >> compute) >> will >> took >> all >> (extra) >> CUs >> and >> high--priority >> will >> wait for >> needed >> resources. >> It >> will >> not >> be >> visible >> on >> "NOP >> " but >> only >> when >> you >> submit >> "real" >> compute >> task >> so I >> would >> recommend >> not >> to >> use >> "NOP" >> packets >> at >> all for >> testing. >> >> It >> (CU >> assignment) >> could >> be >> relatively >> easy >> done when >> everything >> is >> going >> via >> kernel >> (e.g. >> as >> part >> of >> frame >> submission) >> but I >> must >> admit >> that I >> am >> not sure >> about >> the >> best >> way >> for >> user >> level >> submissions >> (amdkfd). >> >> [AR] >> I >> wasn't >> aware >> of >> this >> part >> of >> the >> programming >> sequence. >> Thanks >> for >> the >> heads up! >> Is >> this >> similar >> to >> the >> CU >> masking >> programming? >> [Serguei] >> Yes. >> To >> simplify: >> the >> problem >> is >> that >> "scheduler" >> when >> deciding >> which >> queue >> to >> run >> will >> check >> if >> there >> is >> enough >> resources >> and >> if >> not then >> it >> will >> begin >> to >> check >> other >> queues >> with >> lower >> priority. >> >> 2) I >> would >> recommend >> to >> dedicate >> the >> whole >> pipe to >> high-priority >> queue >> and have >> nothing >> their >> except >> it. >> >> [AR] >> I'm >> guessing >> in >> this >> context >> you >> mean >> pipe >> = queue? >> (as >> opposed >> to >> the >> MEC >> definition >> of >> pipe, >> which >> is a >> grouping >> of >> queues). >> I say >> this >> because >> amdgpu >> only >> has >> access >> to 1 >> pipe, >> and >> the >> rest >> are >> statically >> partitioned >> for >> amdkfd >> usage. >> >> [Serguei] >> No. I >> mean >> pipe >> :-) >> as >> MEC >> define >> it. >> As >> far as I >> understand >> (by >> simplifying) >> some >> scheduling >> is >> per >> pipe. >> I >> know >> about >> the >> current >> allocation >> scheme >> but I >> do >> not think >> that >> it >> is >> ideal. >> I >> would >> assume >> that >> we >> need >> to >> switch to >> dynamical >> partition >> of >> resources >> based >> on >> the >> workload >> otherwise >> we >> will have >> resource >> conflict >> between >> Vulkan >> compute >> and >> OpenCL. >> >> >> BTW: >> Which >> user >> level >> API >> do >> you >> want >> to >> use >> for >> compute: >> Vulkan or >> OpenCL? >> >> [AR] >> Vulkan >> >> [Serguei] >> Vulkan >> works >> via >> amdgpu >> (kernel >> submissions) >> so >> amdkfd >> will >> be not >> involved. >> I >> would >> assume >> that >> in >> the >> case >> of VR >> we will >> have >> one main >> application >> ("console" >> mode(?)) >> so we >> could >> temporally >> "ignore" >> OpenCL/ROCm >> needs >> when >> VR is >> running. >> >> we >> will >> not >> be >> able >> to >> provide >> a >> solution >> compatible >> with >> GFX >> worloads. >> >> I >> assume >> that >> you >> are >> talking >> about >> graphics? >> Am I >> right? >> >> [AR] >> Yeah, >> my >> understanding >> is >> that >> pre-empting >> the >> currently >> running >> graphics >> job >> and >> scheduling >> in >> something >> else >> using >> mid-buffer >> pre-emption >> has >> some >> cases >> where it >> doesn't >> work >> well. >> But >> if with >> polaris10 >> it >> starts >> working >> well, >> it >> might >> be a >> better >> solution >> for >> us >> (because >> the >> whole >> reprojection >> work >> uses >> the >> vulkan >> graphics >> stack >> at >> the >> moment, >> and >> porting >> it to >> compute >> is >> not >> trivial). >> >> [Serguei] >> The >> problem >> with >> pre-emption >> of >> graphics >> task: >> (a) >> it may >> take >> time so >> latency >> may >> suffer >> (b) >> to >> preempt >> we >> need >> to >> have >> different >> "context" >> - we want >> to >> guarantee >> that >> submissions >> from >> the >> same >> context >> will be >> executed >> in order. >> BTW: >> (a) >> Do >> you >> want >> "preempt" >> and >> later >> resume >> or do you >> want >> "preempt" >> and >> "cancel/abort"? >> (b) >> Vulkan >> is >> generic >> API >> and >> could >> be used >> for >> graphics >> as >> well >> as >> for >> plain >> compute >> tasks >> (VK_QUEUE_COMPUTE_BIT). >> >> >> Sincerely >> yours, >> Serguei >> Sagalovitch >> >> >> >> From: >> amd-gfx >> <amd-gfx-bounces at lists.freedesktop.org >> <mailto:amd-gfx-bounces at lists.freedesktop.org>> >> on >> behalf of >> Andres >> Rodriguez >> <andresr at valvesoftware.com >> <mailto:andresr at valvesoftware.com>> >> Sent: >> December >> 16, >> 2016 >> 6:15 PM >> To: >> amd-gfx at lists.freedesktop.org >> <mailto:amd-gfx at lists.freedesktop.org> >> Subject: >> [RFC] >> Mechanism >> for >> high >> priority >> scheduling >> in >> amdgpu >> >> Hi >> Everyone, >> >> This >> RFC >> is >> also >> available >> as a >> gist >> here: >> https://gist.github.com/lostgoat/7000432cd6864265dbc2c3ab93204249 >> <https://gist.github.com/lostgoat/7000432cd6864265dbc2c3ab93204249> >> >> >> >> >> >> [RFC] >> Mechanism >> for >> high >> priority >> scheduling >> in amdgpu >> gist.github.com >> <http://gist.github.com> >> [RFC] >> Mechanism >> for >> high >> priority >> scheduling >> in amdgpu >> >> >> >> [RFC] >> Mechanism >> for >> high >> priority >> scheduling >> in amdgpu >> gist.github.com >> <http://gist.github.com> >> [RFC] >> Mechanism >> for >> high >> priority >> scheduling >> in amdgpu >> >> >> >> >> [RFC] >> Mechanism >> for >> high >> priority >> scheduling >> in amdgpu >> gist.github.com >> <http://gist.github.com> >> [RFC] >> Mechanism >> for >> high >> priority >> scheduling >> in amdgpu >> >> >> We >> are >> interested >> in >> feedback >> for a >> mechanism >> to >> effectively >> schedule >> high >> priority >> VR >> reprojection >> tasks >> (also >> referred >> to as >> time-warping) >> for >> Polaris10 >> running >> on >> the >> amdgpu >> kernel >> driver. >> >> Brief >> context: >> -------------- >> >> The >> main >> objective >> of >> reprojection >> is to >> avoid >> motion >> sickness >> for VR >> users in >> scenarios >> where >> the >> game >> or >> application >> would >> fail >> to finish >> rendering >> a new >> frame >> in >> time >> for >> the >> next >> VBLANK. >> When >> this >> happens, >> the >> user's >> head >> movements >> are >> not >> reflected >> on >> the >> Head >> Mounted >> Display >> (HMD) >> for the >> duration >> of an >> extra >> frame. >> This >> extended >> mismatch >> between >> the >> inner ear >> and the >> eyes may >> cause >> the >> user >> to >> experience >> motion >> sickness. >> >> The >> VR >> compositor >> deals >> with >> this >> problem >> by >> fabricating >> a >> new frame >> using the >> user's >> updated >> head >> position >> in >> combination >> with the >> previous >> frames. >> This >> avoids >> a >> prolonged >> mismatch >> between >> the >> HMD >> output >> and the >> inner >> ear. >> >> Because >> of >> the >> adverse >> effects >> on >> the >> user, >> we >> require >> high >> confidence >> that the >> reprojection >> task >> will >> complete >> before >> the >> VBLANK >> interval. >> Even if >> the >> GFX pipe >> is >> currently >> full >> of >> work >> from >> the >> game/application >> (which >> is most >> likely >> the >> case). >> >> For >> more >> details >> and >> illustrations, >> please >> refer >> to the >> following >> document: >> https://community.amd.com/community/gaming/blog/2016/03/28/asynchronous-shaders-evolved >> <https://community.amd.com/community/gaming/blog/2016/03/28/asynchronous-shaders-evolved> >> >> >> >> >> >> Gaming: >> Asynchronous >> Shaders >> Evolved >> | >> Community >> community.amd.com >> <http://community.amd.com> >> One >> of >> the >> most >> exciting >> new >> developments >> in >> GPU >> technology >> over the >> past >> year >> has >> been >> the >> adoption >> of >> asynchronous >> shaders, >> which can >> make >> more >> efficient >> use >> of ... >> >> >> >> Gaming: >> Asynchronous >> Shaders >> Evolved >> | >> Community >> community.amd.com >> <http://community.amd.com> >> One >> of >> the >> most >> exciting >> new >> developments >> in >> GPU >> technology >> over the >> past >> year >> has >> been >> the >> adoption >> of >> asynchronous >> shaders, >> which can >> make >> more >> efficient >> use >> of ... >> >> >> >> Gaming: >> Asynchronous >> Shaders >> Evolved >> | >> Community >> community.amd.com >> <http://community.amd.com> >> One >> of >> the >> most >> exciting >> new >> developments >> in >> GPU >> technology >> over the >> past >> year >> has >> been >> the >> adoption >> of >> asynchronous >> shaders, >> which can >> make >> more >> efficient >> use >> of ... >> >> >> Requirements: >> ------------- >> >> The >> mechanism >> must >> expose >> the >> following >> functionaility: >> >> * >> Job >> round >> trip >> time >> must >> be >> predictable, >> from >> submission >> to >> fence >> signal >> >> * >> The >> mechanism >> must >> support >> compute >> workloads. >> >> Goals: >> ------ >> >> * >> The >> mechanism >> should >> provide >> low >> submission >> latencies >> >> Test: >> submitting >> a NOP >> packet >> through >> the >> mechanism >> on busy >> hardware >> should >> be >> equivalent >> to >> submitting >> a NOP >> on >> idle >> hardware. >> >> Nice >> to have: >> ------------- >> >> * >> The >> mechanism >> should >> also >> support >> GFX >> workloads. >> >> My >> understanding >> is >> that >> with >> the >> current >> hardware >> capabilities >> in >> Polaris10 >> we >> will >> not >> be >> able >> to >> provide >> a >> solution >> compatible >> with GFX >> worloads. >> >> But I >> would >> love >> to >> hear >> otherwise. >> So if >> anyone >> has an >> idea, >> approach >> or >> suggestion >> that >> will >> also >> be >> compatible >> with >> the >> GFX ring, >> please >> let >> us know >> about it. >> >> * >> The >> above >> guarantees >> should >> also >> be >> respected >> by >> amdkfd >> workloads >> >> Would >> be >> good >> to >> have >> for >> consistency, >> but >> not >> strictly >> necessary >> as >> users >> running >> games >> are >> not >> traditionally >> running >> HPC >> workloads >> in the >> background. >> >> Proposed >> approach: >> ------------------ >> >> Similar >> to >> the >> windows >> driver, >> we >> could >> expose >> a high >> priority >> compute >> queue to >> userspace. >> >> Submissions >> to >> this >> compute >> queue >> will >> be >> scheduled >> with >> high >> priority, >> and may >> acquire >> hardware >> resources >> previously >> in >> use >> by other >> queues. >> >> This >> can >> be >> achieved >> by >> taking >> advantage >> of >> the >> 'priority' >> field in >> the HQDs >> and >> could >> be >> programmed >> by >> amdgpu >> or >> the >> amdgpu >> scheduler. >> The >> relevant >> register >> fields >> are: >> >> * >> mmCP_HQD_PIPE_PRIORITY >> >> * >> mmCP_HQD_QUEUE_PRIORITY >> >> Implementation >> approach >> 1 - >> static >> partitioning: >> ------------------------------------------------ >> >> The >> amdgpu >> driver >> currently >> controls >> 8 >> compute >> queues >> from >> pipe0. >> We can >> statically >> partition >> these >> as >> follows: >> >> * >> 7x >> regular >> >> * >> 1x >> high >> priority >> >> The >> relevant >> priorities >> can >> be >> set >> so >> that >> submissions >> to >> the high >> priority >> ring >> will >> starve >> the >> other >> compute >> rings >> and >> the >> GFX ring. >> >> The >> amdgpu >> scheduler >> will >> only >> place >> jobs >> into >> the high >> priority >> rings >> if the >> context >> is >> marked >> as >> high >> priority. >> And a >> corresponding >> priority >> should be >> added >> to >> keep >> track >> of >> this >> information: >> >> * >> AMD_SCHED_PRIORITY_KERNEL >> >> * -> >> AMD_SCHED_PRIORITY_HIGH >> >> * >> AMD_SCHED_PRIORITY_NORMAL >> >> The >> user >> will >> request >> a >> high >> priority >> context >> by >> setting >> an >> appropriate >> flag >> in >> drm_amdgpu_ctx_in >> (AMDGPU_CTX_HIGH_PRIORITY >> or >> similar): >> https://github.com/torvalds/linux/blob/master/include/uapi/drm/amdgpu_drm.h#L163 >> <https://github.com/torvalds/linux/blob/master/include/uapi/drm/amdgpu_drm.h#L163> >> >> >> >> >> The >> setting >> is in >> a per >> context >> level >> so >> that >> we can: >> * >> Maintain >> a >> consistent >> FIFO >> ordering >> of all >> submissions >> to a >> context >> * >> Create >> high >> priority >> and >> non-high >> priority >> contexts >> in >> the same >> process >> >> Implementation >> approach >> 2 - >> dynamic >> priority >> programming: >> --------------------------------------------------------- >> >> Similar >> to >> the >> above, >> but >> instead >> of >> programming >> the >> priorities >> and >> amdgpu_init() >> time, >> the >> SW >> scheduler >> will >> reprogram >> the >> queue >> priorities >> dynamically >> when >> scheduling >> a task. >> >> This >> would >> involve >> having >> a >> hardware >> specific >> callback >> from >> the >> scheduler >> to >> set >> the >> appropriate >> queue >> priority: >> set_priority(int >> ring, >> int >> index, >> int >> priority) >> >> During >> this >> callback >> we >> would >> have >> to >> grab >> the >> SRBM >> mutex >> to >> perform >> the >> appropriate >> HW >> programming, >> and >> I'm >> not >> really >> sure >> if >> that is >> something >> we >> should >> be >> doing >> from >> the >> scheduler. >> >> On >> the >> positive >> side, >> this >> approach >> would >> allow >> us to >> program >> a >> range of >> priorities >> for >> jobs >> instead >> of a >> single >> "high >> priority" >> value", >> achieving >> something >> similar >> to >> the >> niceness >> API >> available >> for CPU >> scheduling. >> >> I'm >> not >> sure >> if >> this >> flexibility >> is >> something >> that >> we would >> need for >> our use >> case, >> but >> it >> might >> be >> useful >> in >> other >> scenarios >> (multiple >> users >> sharing >> compute >> time >> on a >> server). >> >> This >> approach >> would >> require >> a new >> int >> field in >> drm_amdgpu_ctx_in, >> or >> repurposing >> of >> the >> flags >> field. >> >> Known >> current >> obstacles: >> ------------------------ >> >> The >> SQ is >> currently >> programmed >> to >> disregard >> the HQD >> priorities, >> and >> instead >> it picks >> jobs >> at >> random. >> Settings >> from >> the >> shader >> itself >> are also >> disregarded >> as >> this is >> considered >> a >> privileged >> field. >> >> Effectively >> we >> can >> get >> our >> compute >> wavefront >> launched >> ASAP, >> but we >> might >> not >> get the >> time >> we >> need >> on >> the SQ. >> >> The >> current >> programming >> would >> have >> to be >> changed >> to allow >> priority >> propagation >> from >> the >> HQD >> into >> the SQ. >> >> Generic >> approach >> for >> all >> HW IPs: >> -------------------------------- >> >> For >> consistency >> purposes, >> the >> high >> priority >> context >> can be >> enabled >> for >> all >> HW IPs >> with >> support >> of >> the >> SW >> scheduler. >> This >> will >> function >> similarly >> to the >> current >> AMD_SCHED_PRIORITY_KERNEL >> priority, >> where >> the >> job >> can jump >> ahead of >> anything >> not >> commited >> to >> the >> HW queue. >> >> The >> benefits >> of >> requesting >> a >> high >> priority >> context >> for a >> non-compute >> queue >> will >> be >> lesser >> (e.g. >> up to >> 10s >> of >> wait >> time >> if a >> GFX >> command >> is >> stuck in >> front of >> you), >> but >> having >> the >> API >> in >> place >> will >> allow >> us to >> easily >> improve >> the >> implementation >> in >> the >> future >> as >> new >> features >> become >> available >> in new >> hardware. >> >> Future >> steps: >> ------------- >> >> Once >> we >> have >> an >> approach >> settled, >> I can >> take >> care >> of the >> implementation. >> >> Also, >> once >> the >> interface >> is >> mostly >> decided, >> we >> can start >> thinking >> about >> exposing >> the >> high >> priority >> queue >> through >> radv. >> >> Request >> for >> feedback: >> --------------------- >> >> We >> aren't >> married >> to >> any >> of >> the >> approaches >> outlined >> above. >> Our goal >> is to >> obtain >> a >> mechanism >> that >> will >> allow >> us to >> complete >> the >> reprojection >> job >> within a >> predictable >> amount >> of >> time. >> So if >> anyone >> anyone >> has any >> suggestions >> for >> improvements >> or >> alternative >> strategies >> we >> are >> more than >> happy >> to hear >> them. >> >> If >> any >> of >> the >> technical >> information >> above >> is also >> incorrect, >> feel >> free >> to point >> out >> my >> misunderstandings. >> >> Looking >> forward >> to >> hearing >> from you. >> >> Regards, >> Andres >> >> _______________________________________________ >> amd-gfx >> mailing >> list >> amd-gfx at lists.freedesktop.org >> <mailto:amd-gfx at lists.freedesktop.org> >> https://lists.freedesktop.org/mailman/listinfo/amd-gfx >> <https://lists.freedesktop.org/mailman/listinfo/amd-gfx> >> >> >> amd-gfx >> Info >> Page >> - >> lists.freedesktop.org >> <http://lists.freedesktop.org> >> lists.freedesktop.org >> <http://lists.freedesktop.org> >> To >> see >> the >> collection >> of >> prior >> postings >> to >> the list, >> visit the >> amd-gfx >> Archives. >> Using >> amd-gfx: >> To >> post >> a >> message >> to all >> the list >> members, >> send >> email ... >> >> >> >> amd-gfx >> Info >> Page >> - >> lists.freedesktop.org >> <http://lists.freedesktop.org> >> lists.freedesktop.org >> <http://lists.freedesktop.org> >> To >> see >> the >> collection >> of >> prior >> postings >> to >> the list, >> visit the >> amd-gfx >> Archives. >> Using >> amd-gfx: >> To >> post >> a >> message >> to all >> the list >> members, >> send >> email ... >> >> >> >> >> >> >> >> >> >> >> >> _______________________________________________ >> amd-gfx >> mailing >> list >> amd-gfx at lists.freedesktop.org >> <mailto:amd-gfx at lists.freedesktop.org> >> https://lists.freedesktop.org/mailman/listinfo/amd-gfx >> <https://lists.freedesktop.org/mailman/listinfo/amd-gfx> >> >> >> _______________________________________________ >> amd-gfx >> mailing list >> amd-gfx at lists.freedesktop.org >> <mailto:amd-gfx at lists.freedesktop.org> >> https://lists.freedesktop.org/mailman/listinfo/amd-gfx >> <https://lists.freedesktop.org/mailman/listinfo/amd-gfx> >> >> >> >> >> _______________________________________________ >> amd-gfx mailing list >> amd-gfx at lists.freedesktop.org >> <mailto:amd-gfx at lists.freedesktop.org> >> https://lists.freedesktop.org/mailman/listinfo/amd-gfx >> <https://lists.freedesktop.org/mailman/listinfo/amd-gfx> >> >> >> >> >> Sincerely yours, >> Serguei Sagalovitch >> >> _______________________________________________ >> amd-gfx mailing list >> amd-gfx at lists.freedesktop.org >> <mailto:amd-gfx at lists.freedesktop.org> >> https://lists.freedesktop.org/mailman/listinfo/amd-gfx >> <https://lists.freedesktop.org/mailman/listinfo/amd-gfx> >> >> >> >> >> >> >> >> Sincerely yours, >> Serguei Sagalovitch >> >> >> >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20170102/f11a2405/attachment-0001.html>