[PATCH 1/3] drm/amdgpu: fix a typo

maraeo@xxxxxxxxx (Marek Olšák) · Sat, 24 Jun 2017 02:29:20 +0200

On Fri, Jun 23, 2017 at 1:55 PM, Zhou, David(ChunMing)
<David1.Zhou at amd.com> wrote:
>
> ________________________________________
> From: Marek OlÅ¡Ã¡k [maraeo at gmail.com]
> Sent: Friday, June 23, 2017 6:49 PM
> To: Christian KÃ¶nig
> Cc: Zhou, David(ChunMing); Xie, AlexBin; amd-gfx at lists.freedesktop.org; Xie, AlexBin
> Subject: Re: [PATCH 1/3] drm/amdgpu: fix a typo
>
> On Fri, Jun 23, 2017 at 11:27 AM, Christian KÃ¶nig
> <deathsimple at vodafone.de> wrote:
>> Am 23.06.2017 um 11:08 schrieb zhoucm1:
>>>
>>>
>>>
>>> On 2017å¹´06æ??23æ?¥ 17:01, zhoucm1 wrote:
>>>>
>>>>
>>>>
>>>> On 2017å¹´06æ??23æ?¥ 16:25, Christian KÃ¶nig wrote:
>>>>>
>>>>> Am 23.06.2017 um 09:09 schrieb zhoucm1:
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 2017å¹´06æ??23æ?¥ 14:57, Christian KÃ¶nig wrote:
>>>>>>>
>>>>>>> But giving the CS IOCTL an option for directly specifying the BOs
>>>>>>> instead of a BO list like Marek suggested would indeed save us some time
>>>>>>> here.
>>>>>>
>>>>>> interesting, I always follow how to improve our cs ioctl, since UMD
>>>>>> guys aften complain our command submission is slower than windows.
>>>>>> Then how to directly specifying the BOs instead of a BO list? BO handle
>>>>>> array from UMD? Could your guys describe more clear? Is it doable?
>>>>>
>>>>>
>>>>> Making the BO list part of the CS IOCTL wouldn't help at all for the
>>>>> close source UMDs. To be precise we actually came up with the BO list
>>>>> approach because of their requirement.
>>>>>
>>>>> The biggest bunch of work during CS is reserving all the buffers,
>>>>> validating them and checking their VM status.
>>>>
>>>> Totally agree. Every time when I read code there, I often want to
>>>> optimize them.
>>>>
>>>>> It doesn't matter if the BOs come from the BO list or directly in the CS
>>>>> IOCTL.
>>>>>
>>>>> The key point is that CS overhead is pretty much irrelevant for the open
>>>>> source stack, since Mesa does command submission from a separate thread
>>>>> anyway.
>>>>
>>>> If irrelevant for the open stack, then how does open source stack handle
>>>> "The biggest bunch of work during CS is reserving all the buffers,
>>>> validating them and checking their VM status."?
>>
>>
>> Command submission on the open stack is outsourced to a separate user space
>> thread. E.g. when an application triggers a flush the IBs created so far are
>> just put on a queue and another thread pushes them down to the kernel.
>>
>> I mean reducing the overhead of the CS IOCTL is always nice, but you usual
>> won't see any fps increase as long as not all CPUs are completely bound to
>> some tasks.
>>
>>>> If open stack has a better way, I think closed stack can follow it, I
>>>> don't know the history.
>>>
>>> Do you not use bo list at all in mesa? radv as well?
>>
>>
>> I don't think so. Mesa just wants to send the list of used BOs down to the
>> kernel with every IOCTL.
>
> The CS ioctl actually costs us some performance, but not as much as on
> closed source drivers.
>
> MesaGL always executes all CS ioctls in a separate thread (in parallel
> with the UMD) except for the last IB that's submitted by SwapBuffers.
> SwapBuffers requires that all IBs have been submitted when SwapBuffers
> returns. For example, if you have 5 IBs per frame, 4 of them are
> executed on the thread and the overhead is hidden. The last one is
> executed on the thread too, but this time the Mesa driver has to wait
> for it. For things like glxgears with only 1 IB per frame, the thread
> doesn't hide anything and Mesa always has to wait for it after
> submission, just because of SwapBuffers.
>
> Having 10 or more IBs per frame is great, because 9 are done in
> parallel and the last one is synchronous. The final CPU cost is 10x
> lower, but it's not zero.
> [DZ] Thanks Marek, this is very useful and helpful message for me to understand Mesa action of CS, I will talk to closed guys to see if it can be used for them.
> Anothing I also want to confirm with you, do you know if radv is using this cs way?
>
> For us, it's certainly useful to optimize the CS ioctl because of apps
> that submit only 1 IB per frame where multithreading has no effect or
> may even hurt performance.
>
> The most obvious inefficiency is the BO_LIST ioctl that is completely
> unnecessary and only slows us down. What we need is exactly what
> radeon does.
>
> [DZ] I don't know how radeon handle bo list, could you describe it as well?

Inputs for the following ioctls are:

AMDGPU: BO_LIST:
- list of BOs

AMDGPU: CS
- list of IBs
- BO list handle

RADEON: CS
- one IB
- list of BOs

Ideal solution for a new amdgpu CS ioctl:
- list of IBs
- list of BOs

Marek