Add support for high priority scheduling in amdgpu

deathsimple@xxxxxxxxxxx (Christian König) · Thu, 2 Mar 2017 12:00:33 +0100

Am 01.03.2017 um 18:24 schrieb Andres Rodriguez:
>
>
> On 2017-03-01 12:13 PM, Andres Rodriguez wrote:
>>
>>
>> On 3/1/2017 6:42 AM, Christian KÃ¶nig wrote:
>>> Patches #1-#14 are Acked-by: Christian KÃ¶nig 
>>> <christian.koenig at amd.com>.
>>>
>>> Patch #15:
>>>
>>> Not sure if that is a good idea or not, need to take a closer look 
>>> after digging through the rest.
>>>
>>> In general the HW IP is just for the IOCTL API and not for internal 
>>> use inside the driver.
>> I'll drop this patch and use ring->funcs->type instead.
>>>
>>> Patch #16:
>>>
>>> Really nice :) I don't have time to look into it in detail, but you 
>>> have one misconception I like to point out:
>>>> The queue manager maintains a per-file descriptor map of user ring ids
>>>> to amdgpu_ring pointers. Once a map is created it is permanent 
>>>> (this is
>>>> required to maintain FIFO execution guarantees for a ring).
>>> Actually we don't have a FIFO execution guarantee per ring. We only 
>>> have that per context.
>>
>> Agreed. I'm using pretty imprecise terminology here which can be 
>> confusing. I wanted to be more precise than "context", because two 
>> amdgpu_cs_request submissions to the same context but with a 
>> different ring field can execute out of order.
>>
>> I think s/ring/context's ring/ should be enough to clarify here if 
>> you think so as well.

Yeah, just fix the description a bit and we are good to go.

>>
>>>
>>> E.g. commands from different context can execute at the same time 
>>> and out of order.
>>>
>>> Making this per file is ok for now, but you should keep in mind that 
>>> we might want to change that sooner or later.
>>>
>>> Patch #17 & #18 need to take a closer look when I have more time, 
>>> but the comments from others sounded valid to me as well.
>>>
>>> Patch #19: Raising and lowering the priority of a ring during 
>>> command submission doesn't sound like a good idea to me.
>> I'm not really sure what would be a better time than at command 
>> submission.
>>
>> If it was just SPI priorities we could have static partitioning of 
>> rings, some high priority and some regular, etc. But that approach 
>> reduces the number of rings
> Sorry, I finished typing something else and forgot this section was 
> incomplete. Full reply:
>
> I'm not really sure what would be a better time than at command 
> submission.
>
> If it was just SPI priorities we could have static partitioning of 
> rings, some high priority and some regular, etc. But that approach 
> reduces the number of rings available. It would also require a 
> callback at command submission time for CU reservation.

Ok, as Alex wrote as well I'm not 100% sure if that really works on all 
hardware. But we could give it a try, cause I don't see much of a better 
alternative either.

>>>
>>> The way you currently have it implemented would also raise the 
>>> priority of already running jobs on the same ring. Keep in mind that 
>>> everything is pipelined here.
>> That is actually intentional. If there is work already on the ring 
>> with lower priority we don't want the high priority work to have to 
>> wait for it to finish executing at regular priority. Therefore the 
>> work that has already been commited to the ring inherits the higher 
>> priority level.
>>
>> I agree this isn't ideal, which is why the LRU ring mapping policy is 
>> there to make sure this doesn't happen often.
>>>
>>> Additional to that you can't have a fence callback in the job 
>>> structure, cause the job structure is freed by the same fence as 
>>> well. So it can happen that you access freed up memory (but only for 
>>> a very short period of time).
>> Any strong preference for either 1) refcounting the job structure, or 
>> 2) allocating a new piece of memory to store the callback parameters?

How about option #3, just add that to the job lifetime.

See drivers/gpu/drm/amd/amdgpu/amdgpu_job.c. amdgpu_job_run() is called 
when the job is ready to run.

amdgpu_job_free_cb() is from a work item after a scheduler job has 
finished executing. If that is to late we could also add another 
callback to the scheduler for this.

amdgpu_job_free() is called when we directly submitted the job without 
going through the scheduler. Don't touch it that is just for GPU reset 
handling.

Regards,
Christian.

>>
>>> Patches #20-#22 are Acked-by: Christian KÃ¶nig 
>>> <christian.koenig at amd.com>.
>>>
>>> Regards,
>>> Christian.
>>>
>>> Am 28.02.2017 um 23:14 schrieb Andres Rodriguez:
>>>> This patch series introduces a mechanism that allows users with 
>>>> sufficient
>>>> privileges to categorize their work as "high priority". A userspace 
>>>> app can
>>>> create a high priority amdgpu context, where any work submitted to 
>>>> this context
>>>> will receive preferential treatment over any other work.
>>>>
>>>> High priority contexts will be scheduled ahead of other contexts by 
>>>> the sw gpu
>>>> scheduler. This functionality is generic for all HW blocks.
>>>>
>>>> Optionally, a ring can implement a set_priority() function that allows
>>>> programming HW specific features to elevate a ring's priority.
>>>>
>>>> This patch series implements set_priority() for gfx8 compute rings. 
>>>> It takes
>>>> advantage of SPI scheduling and CU reservation to provide improved 
>>>> frame
>>>> latencies for high priority contexts.
>>>>
>>>> For compute + compute scenarios we get near perfect scheduling 
>>>> latency. E.g.
>>>> one high priority ComputeParticles + one low priority 
>>>> ComputeParticles:
>>>>      - High priority ComputeParticles: 2.0-2.6 ms/frame
>>>>      - Regular ComputeParticles: 35.2-68.5 ms/frame
>>>>
>>>> For compute + gfx scenarios the high priority compute application does
>>>> experience some latency variance. However, the variance has smaller 
>>>> bounds and
>>>> a smalled deviation then without high priority scheduling.
>>>>
>>>> Following is a graph of the frame time experienced by a high 
>>>> priority compute
>>>> app in 4 different scenarios to exemplify the compute + gfx latency 
>>>> variance:
>>>>      - ComputeParticles: this scenario invloves running the compute 
>>>> particles
>>>>        sample on its own.
>>>>      - +SSAO: Previous scenario with the addition of running the 
>>>> ssao sample
>>>>        application that clogs the GFX ring with constant work.
>>>>      - +SPI Priority: Previous scenario with the addition of SPI 
>>>> priority
>>>>        programming for compute rings.
>>>>      - +CU Reserve: Previous scenario with the addition of dynamic CU
>>>>        reservation for compute rings.
>>>>
>>>> Graph link:
>>>> https://plot.ly/~lostgoat/9/
>>>>
>>>> As seen above, high priority contexts for compute allow us to 
>>>> schedule work
>>>> with enhanced confidence of completion latency under high GPU 
>>>> loads. This
>>>> property will be important for VR reprojection workloads.
>>>>
>>>> Note: The first part of this series is a resend of "Change 
>>>> queue/pipe split
>>>> between amdkfd and amdgpu" with the following changes:
>>>>      - Fixed kfdtest on Kaveri due to shift overflow. Refer to: 
>>>> "drm/amdkfdallow
>>>>        split HQD on per-queue granularity v3"
>>>>      - Used Felix's suggestions for a simplified HQD programming 
>>>> sequence
>>>>      - Added a workaround for a Tonga HW bug during HQD programming
>>>>
>>>> This series is also available at:
>>>> https://github.com/lostgoat/linux/tree/wip-high-priority
>>>>
>>>> _______________________________________________
>>>> amd-gfx mailing list
>>>> amd-gfx at lists.freedesktop.org
>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>
>>>
>>
>>
>