On Fri, Jun 23, 2017 at 1:55 PM, Zhou, David(ChunMing) <David1.Zhou at amd.com> wrote: > > ________________________________________ > From: Marek Olšák [maraeo at gmail.com] > Sent: Friday, June 23, 2017 6:49 PM > To: Christian König > Cc: Zhou, David(ChunMing); Xie, AlexBin; amd-gfx at lists.freedesktop.org; Xie, AlexBin > Subject: Re: [PATCH 1/3] drm/amdgpu: fix a typo > > On Fri, Jun 23, 2017 at 11:27 AM, Christian König > <deathsimple at vodafone.de> wrote: >> Am 23.06.2017 um 11:08 schrieb zhoucm1: >>> >>> >>> >>> On 2017å¹´06æ??23æ?¥ 17:01, zhoucm1 wrote: >>>> >>>> >>>> >>>> On 2017å¹´06æ??23æ?¥ 16:25, Christian König wrote: >>>>> >>>>> Am 23.06.2017 um 09:09 schrieb zhoucm1: >>>>>> >>>>>> >>>>>> >>>>>> On 2017å¹´06æ??23æ?¥ 14:57, Christian König wrote: >>>>>>> >>>>>>> But giving the CS IOCTL an option for directly specifying the BOs >>>>>>> instead of a BO list like Marek suggested would indeed save us some time >>>>>>> here. >>>>>> >>>>>> interesting, I always follow how to improve our cs ioctl, since UMD >>>>>> guys aften complain our command submission is slower than windows. >>>>>> Then how to directly specifying the BOs instead of a BO list? BO handle >>>>>> array from UMD? Could your guys describe more clear? Is it doable? >>>>> >>>>> >>>>> Making the BO list part of the CS IOCTL wouldn't help at all for the >>>>> close source UMDs. To be precise we actually came up with the BO list >>>>> approach because of their requirement. >>>>> >>>>> The biggest bunch of work during CS is reserving all the buffers, >>>>> validating them and checking their VM status. >>>> >>>> Totally agree. Every time when I read code there, I often want to >>>> optimize them. >>>> >>>>> It doesn't matter if the BOs come from the BO list or directly in the CS >>>>> IOCTL. >>>>> >>>>> The key point is that CS overhead is pretty much irrelevant for the open >>>>> source stack, since Mesa does command submission from a separate thread >>>>> anyway. >>>> >>>> If irrelevant for the open stack, then how does open source stack handle >>>> "The biggest bunch of work during CS is reserving all the buffers, >>>> validating them and checking their VM status."? >> >> >> Command submission on the open stack is outsourced to a separate user space >> thread. E.g. when an application triggers a flush the IBs created so far are >> just put on a queue and another thread pushes them down to the kernel. >> >> I mean reducing the overhead of the CS IOCTL is always nice, but you usual >> won't see any fps increase as long as not all CPUs are completely bound to >> some tasks. >> >>>> If open stack has a better way, I think closed stack can follow it, I >>>> don't know the history. >>> >>> Do you not use bo list at all in mesa? radv as well? >> >> >> I don't think so. Mesa just wants to send the list of used BOs down to the >> kernel with every IOCTL. > > The CS ioctl actually costs us some performance, but not as much as on > closed source drivers. > > MesaGL always executes all CS ioctls in a separate thread (in parallel > with the UMD) except for the last IB that's submitted by SwapBuffers. > SwapBuffers requires that all IBs have been submitted when SwapBuffers > returns. For example, if you have 5 IBs per frame, 4 of them are > executed on the thread and the overhead is hidden. The last one is > executed on the thread too, but this time the Mesa driver has to wait > for it. For things like glxgears with only 1 IB per frame, the thread > doesn't hide anything and Mesa always has to wait for it after > submission, just because of SwapBuffers. > > Having 10 or more IBs per frame is great, because 9 are done in > parallel and the last one is synchronous. The final CPU cost is 10x > lower, but it's not zero. > [DZ] Thanks Marek, this is very useful and helpful message for me to understand Mesa action of CS, I will talk to closed guys to see if it can be used for them. > Anothing I also want to confirm with you, do you know if radv is using this cs way? > > For us, it's certainly useful to optimize the CS ioctl because of apps > that submit only 1 IB per frame where multithreading has no effect or > may even hurt performance. > > The most obvious inefficiency is the BO_LIST ioctl that is completely > unnecessary and only slows us down. What we need is exactly what > radeon does. > > [DZ] I don't know how radeon handle bo list, could you describe it as well? Inputs for the following ioctls are: AMDGPU: BO_LIST: - list of BOs AMDGPU: CS - list of IBs - BO list handle RADEON: CS - one IB - list of BOs Ideal solution for a new amdgpu CS ioctl: - list of IBs - list of BOs Marek