Am 01.03.2017 um 18:24 schrieb Andres Rodriguez: > > > On 2017-03-01 12:13 PM, Andres Rodriguez wrote: >> >> >> On 3/1/2017 6:42 AM, Christian König wrote: >>> Patches #1-#14 are Acked-by: Christian König >>> <christian.koenig at amd.com>. >>> >>> Patch #15: >>> >>> Not sure if that is a good idea or not, need to take a closer look >>> after digging through the rest. >>> >>> In general the HW IP is just for the IOCTL API and not for internal >>> use inside the driver. >> I'll drop this patch and use ring->funcs->type instead. >>> >>> Patch #16: >>> >>> Really nice :) I don't have time to look into it in detail, but you >>> have one misconception I like to point out: >>>> The queue manager maintains a per-file descriptor map of user ring ids >>>> to amdgpu_ring pointers. Once a map is created it is permanent >>>> (this is >>>> required to maintain FIFO execution guarantees for a ring). >>> Actually we don't have a FIFO execution guarantee per ring. We only >>> have that per context. >> >> Agreed. I'm using pretty imprecise terminology here which can be >> confusing. I wanted to be more precise than "context", because two >> amdgpu_cs_request submissions to the same context but with a >> different ring field can execute out of order. >> >> I think s/ring/context's ring/ should be enough to clarify here if >> you think so as well. Yeah, just fix the description a bit and we are good to go. >> >>> >>> E.g. commands from different context can execute at the same time >>> and out of order. >>> >>> Making this per file is ok for now, but you should keep in mind that >>> we might want to change that sooner or later. >>> >>> Patch #17 & #18 need to take a closer look when I have more time, >>> but the comments from others sounded valid to me as well. >>> >>> Patch #19: Raising and lowering the priority of a ring during >>> command submission doesn't sound like a good idea to me. >> I'm not really sure what would be a better time than at command >> submission. >> >> If it was just SPI priorities we could have static partitioning of >> rings, some high priority and some regular, etc. But that approach >> reduces the number of rings > Sorry, I finished typing something else and forgot this section was > incomplete. Full reply: > > I'm not really sure what would be a better time than at command > submission. > > If it was just SPI priorities we could have static partitioning of > rings, some high priority and some regular, etc. But that approach > reduces the number of rings available. It would also require a > callback at command submission time for CU reservation. Ok, as Alex wrote as well I'm not 100% sure if that really works on all hardware. But we could give it a try, cause I don't see much of a better alternative either. >>> >>> The way you currently have it implemented would also raise the >>> priority of already running jobs on the same ring. Keep in mind that >>> everything is pipelined here. >> That is actually intentional. If there is work already on the ring >> with lower priority we don't want the high priority work to have to >> wait for it to finish executing at regular priority. Therefore the >> work that has already been commited to the ring inherits the higher >> priority level. >> >> I agree this isn't ideal, which is why the LRU ring mapping policy is >> there to make sure this doesn't happen often. >>> >>> Additional to that you can't have a fence callback in the job >>> structure, cause the job structure is freed by the same fence as >>> well. So it can happen that you access freed up memory (but only for >>> a very short period of time). >> Any strong preference for either 1) refcounting the job structure, or >> 2) allocating a new piece of memory to store the callback parameters? How about option #3, just add that to the job lifetime. See drivers/gpu/drm/amd/amdgpu/amdgpu_job.c. amdgpu_job_run() is called when the job is ready to run. amdgpu_job_free_cb() is from a work item after a scheduler job has finished executing. If that is to late we could also add another callback to the scheduler for this. amdgpu_job_free() is called when we directly submitted the job without going through the scheduler. Don't touch it that is just for GPU reset handling. Regards, Christian. >> >>> Patches #20-#22 are Acked-by: Christian König >>> <christian.koenig at amd.com>. >>> >>> Regards, >>> Christian. >>> >>> Am 28.02.2017 um 23:14 schrieb Andres Rodriguez: >>>> This patch series introduces a mechanism that allows users with >>>> sufficient >>>> privileges to categorize their work as "high priority". A userspace >>>> app can >>>> create a high priority amdgpu context, where any work submitted to >>>> this context >>>> will receive preferential treatment over any other work. >>>> >>>> High priority contexts will be scheduled ahead of other contexts by >>>> the sw gpu >>>> scheduler. This functionality is generic for all HW blocks. >>>> >>>> Optionally, a ring can implement a set_priority() function that allows >>>> programming HW specific features to elevate a ring's priority. >>>> >>>> This patch series implements set_priority() for gfx8 compute rings. >>>> It takes >>>> advantage of SPI scheduling and CU reservation to provide improved >>>> frame >>>> latencies for high priority contexts. >>>> >>>> For compute + compute scenarios we get near perfect scheduling >>>> latency. E.g. >>>> one high priority ComputeParticles + one low priority >>>> ComputeParticles: >>>> - High priority ComputeParticles: 2.0-2.6 ms/frame >>>> - Regular ComputeParticles: 35.2-68.5 ms/frame >>>> >>>> For compute + gfx scenarios the high priority compute application does >>>> experience some latency variance. However, the variance has smaller >>>> bounds and >>>> a smalled deviation then without high priority scheduling. >>>> >>>> Following is a graph of the frame time experienced by a high >>>> priority compute >>>> app in 4 different scenarios to exemplify the compute + gfx latency >>>> variance: >>>> - ComputeParticles: this scenario invloves running the compute >>>> particles >>>> sample on its own. >>>> - +SSAO: Previous scenario with the addition of running the >>>> ssao sample >>>> application that clogs the GFX ring with constant work. >>>> - +SPI Priority: Previous scenario with the addition of SPI >>>> priority >>>> programming for compute rings. >>>> - +CU Reserve: Previous scenario with the addition of dynamic CU >>>> reservation for compute rings. >>>> >>>> Graph link: >>>> https://plot.ly/~lostgoat/9/ >>>> >>>> As seen above, high priority contexts for compute allow us to >>>> schedule work >>>> with enhanced confidence of completion latency under high GPU >>>> loads. This >>>> property will be important for VR reprojection workloads. >>>> >>>> Note: The first part of this series is a resend of "Change >>>> queue/pipe split >>>> between amdkfd and amdgpu" with the following changes: >>>> - Fixed kfdtest on Kaveri due to shift overflow. Refer to: >>>> "drm/amdkfdallow >>>> split HQD on per-queue granularity v3" >>>> - Used Felix's suggestions for a simplified HQD programming >>>> sequence >>>> - Added a workaround for a Tonga HW bug during HQD programming >>>> >>>> This series is also available at: >>>> https://github.com/lostgoat/linux/tree/wip-high-priority >>>> >>>> _______________________________________________ >>>> amd-gfx mailing list >>>> amd-gfx at lists.freedesktop.org >>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx >>> >>> >> >> >