[PATCH 1/3] drm/amdgpu: fix a typo

maraeo@xxxxxxxxx (Marek Olšák) · Sat, 24 Jun 2017 22:49:09 +0200

On Sat, Jun 24, 2017 at 2:29 AM, Marek OlÅ¡Ã¡k <maraeo at gmail.com> wrote:
> On Fri, Jun 23, 2017 at 1:55 PM, Zhou, David(ChunMing)
> <David1.Zhou at amd.com> wrote:
>>
>> ________________________________________
>> From: Marek OlÅ¡Ã¡k [maraeo at gmail.com]
>> Sent: Friday, June 23, 2017 6:49 PM
>> To: Christian KÃ¶nig
>> Cc: Zhou, David(ChunMing); Xie, AlexBin; amd-gfx at lists.freedesktop.org; Xie, AlexBin
>> Subject: Re: [PATCH 1/3] drm/amdgpu: fix a typo
>>
>> On Fri, Jun 23, 2017 at 11:27 AM, Christian KÃ¶nig
>> <deathsimple at vodafone.de> wrote:
>>> Am 23.06.2017 um 11:08 schrieb zhoucm1:
>>>>
>>>>
>>>>
>>>> On 2017å¹´06æ??23æ?¥ 17:01, zhoucm1 wrote:
>>>>>
>>>>>
>>>>>
>>>>> On 2017å¹´06æ??23æ?¥ 16:25, Christian KÃ¶nig wrote:
>>>>>>
>>>>>> Am 23.06.2017 um 09:09 schrieb zhoucm1:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 2017å¹´06æ??23æ?¥ 14:57, Christian KÃ¶nig wrote:
>>>>>>>>
>>>>>>>> But giving the CS IOCTL an option for directly specifying the BOs
>>>>>>>> instead of a BO list like Marek suggested would indeed save us some time
>>>>>>>> here.
>>>>>>>
>>>>>>> interesting, I always follow how to improve our cs ioctl, since UMD
>>>>>>> guys aften complain our command submission is slower than windows.
>>>>>>> Then how to directly specifying the BOs instead of a BO list? BO handle
>>>>>>> array from UMD? Could your guys describe more clear? Is it doable?
>>>>>>
>>>>>>
>>>>>> Making the BO list part of the CS IOCTL wouldn't help at all for the
>>>>>> close source UMDs. To be precise we actually came up with the BO list
>>>>>> approach because of their requirement.
>>>>>>
>>>>>> The biggest bunch of work during CS is reserving all the buffers,
>>>>>> validating them and checking their VM status.
>>>>>
>>>>> Totally agree. Every time when I read code there, I often want to
>>>>> optimize them.
>>>>>
>>>>>> It doesn't matter if the BOs come from the BO list or directly in the CS
>>>>>> IOCTL.
>>>>>>
>>>>>> The key point is that CS overhead is pretty much irrelevant for the open
>>>>>> source stack, since Mesa does command submission from a separate thread
>>>>>> anyway.
>>>>>
>>>>> If irrelevant for the open stack, then how does open source stack handle
>>>>> "The biggest bunch of work during CS is reserving all the buffers,
>>>>> validating them and checking their VM status."?
>>>
>>>
>>> Command submission on the open stack is outsourced to a separate user space
>>> thread. E.g. when an application triggers a flush the IBs created so far are
>>> just put on a queue and another thread pushes them down to the kernel.
>>>
>>> I mean reducing the overhead of the CS IOCTL is always nice, but you usual
>>> won't see any fps increase as long as not all CPUs are completely bound to
>>> some tasks.
>>>
>>>>> If open stack has a better way, I think closed stack can follow it, I
>>>>> don't know the history.
>>>>
>>>> Do you not use bo list at all in mesa? radv as well?
>>>
>>>
>>> I don't think so. Mesa just wants to send the list of used BOs down to the
>>> kernel with every IOCTL.
>>
>> The CS ioctl actually costs us some performance, but not as much as on
>> closed source drivers.
>>
>> MesaGL always executes all CS ioctls in a separate thread (in parallel
>> with the UMD) except for the last IB that's submitted by SwapBuffers.
>> SwapBuffers requires that all IBs have been submitted when SwapBuffers
>> returns. For example, if you have 5 IBs per frame, 4 of them are
>> executed on the thread and the overhead is hidden. The last one is
>> executed on the thread too, but this time the Mesa driver has to wait
>> for it. For things like glxgears with only 1 IB per frame, the thread
>> doesn't hide anything and Mesa always has to wait for it after
>> submission, just because of SwapBuffers.
>>
>> Having 10 or more IBs per frame is great, because 9 are done in
>> parallel and the last one is synchronous. The final CPU cost is 10x
>> lower, but it's not zero.
>> [DZ] Thanks Marek, this is very useful and helpful message for me to understand Mesa action of CS, I will talk to closed guys to see if it can be used for them.
>> Anothing I also want to confirm with you, do you know if radv is using this cs way?
>>
>> For us, it's certainly useful to optimize the CS ioctl because of apps
>> that submit only 1 IB per frame where multithreading has no effect or
>> may even hurt performance.
>>
>> The most obvious inefficiency is the BO_LIST ioctl that is completely
>> unnecessary and only slows us down. What we need is exactly what
>> radeon does.
>>
>> [DZ] I don't know how radeon handle bo list, could you describe it as well?
>
> Inputs for the following ioctls are:
>
> AMDGPU: BO_LIST:
> - list of BOs
>
> AMDGPU: CS
> - list of IBs
> - BO list handle
>
> RADEON: CS
> - one IB
> - list of BOs
>
> Ideal solution for a new amdgpu CS ioctl:
> - list of IBs
> - list of BOs

I'd like to say that the current CS ioctl design is only a half of the
problem with slow command submission. The second half is the libdrm
overhead itself. Having wrapper functions around ioctls that have to
unwrap input objects into alloca'd memory just for the ioctl to be
called is simply wasted CPU time. There are cases where libdrm is
useful. Command submission is not one of them. Any driver developer or
vendor putting CS wrappers into libdrm is putting himself into a
losing position. It's a perpetuated community myth that libdrm should
have wrappers for everything.

Our winsys in Mesa is designed such that it can call the CS ioctl
right when it's requested. The whole CS ioctl input structure is
always ready at any point in time, because it's updated incrementally
while draw calls are made. The radeon winsys works that way. The
amdgpu winsys works that way too, but there is another translation of
inputs in libdrm that defeats it. Intel also publicly admitted that
putting CS wrappers into libdrm was stupid.

The solution for the best performance is to call the CS ioctl directly
from UMDs, i.e. Mesa and Vulkan should call drmCommandWriteRead(fd,
DRM_AMDGPU_CS2, ...) directly.

Until that's done, the command submission for the radeon kernel driver
will remain faster than amdgpu.

Marek