Re: "Fixes" for page flipping under PRIME on AMD & nouveau

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



To pick this up again after a week of manic testing :)

On 08/18/2016 04:23 AM, Michel Dänzer wrote:
On 18/08/16 01:12 AM, Mario Kleiner wrote:

Intel as display gpu + nouveau for render offload worked nicely
on intel-ddx with page flipping, proper timing, dmabuf fence sync
and all.

How about with AMD instead of nouveau in this case?


I don't have any real AMD Enduro laptop with either Intel + AMD or AMD + AMD atm., so i tested with my hacked up setups, but there things look very good:

a) A standard PC with Intel Haswell + AMD Tonga Pro R9 380. Seems to work correctly, page-flipping used, no visual artifacts or other problems, my measurement equipment also shows perfect timing and no glitches. Performance is very good, even without Marek's recent SDMA + PRIME patch series. Seems though with his patches some of the many criterions for using it doesn't get satisfied so it uses a fallback path on my machine.

One thing that confuses me so far is that visual results and measurment suggest it works nicely, properly serializing the rendering/detiling blit and the pageflip. But when i ftrace the Intel drivers reservation_object_wait_timeout_rcu() call where it normally waits for the dmabuf fence to complete then i never see it blocking for more than a few dozen microseconds, and i couldn't find any other place where it blocks on detiling blit completion yet. Iow. it seems to work correctly in practice, but i don't know where it actually blocks. Could also be that the flip work func in intels driver just executes after the detiling blit has already completed.

b) A MacPro with dual Radeon HD-5770 and NVidia GeForce, and my pageflip hacks applied. I ported Marek's Mesa SDMA patch to r600, and with that i get very good performance for AMD Evergreen as renderoffload gpu both for the NVidia + AMD and AMD + AMD combo. So this solved the performance problems on the older gpus. I assume Intel + old radeon-kms would just behave equally well. So thanks Marek, that was perfect!

I guess that means we are really good now wrt. renderoffload whenever an Intel iGPU is used for display, regardless if nouveau or AMD is used as dGPU :)


Turns out that prime + page flipping currently doesn't work
on nouveau and amd. The first offload rendered images from
the imported dmabufs show up properly, but then the display
is stuck alternating between the first two or three rendered
frames.

The problem is that during the pageflip ioctl we pin the
dmabuf into VRAM in preparation for scanout, then unpin it
when we are done with it at next flip, but the buffer stays
in the VRAM memory domain.

Sounds like you found a bug here: BOs which are being shared between
different GPUs should always be pinned to GTT, moving them to VRAM (and
consequently the page flip) should fail.


Seems so, although i hoped i was fixing a bug, not exploiting a loophole. In practice i haven't observed trouble with the hack so far. I havent't looked deeply enough into how the dma api below dmabuf operates, so this is just guesswork, but i suspect the reason that this doesn't blow up in an obvious way is that if the render offload gpu exports the dmabuf then the pages get pinned/locked into system RAM, so the pages can't move around or get paged out to swap, as long as the dmabuf stays exported. When the dmabuf importing AMD or nouveau display gpu then moves the bo from GTT to VRAM (or pseudo-moves it back with my hack) all that changes is some pin refcount for the RAM pages, but the refcount always stays non-zero and system RAM isn't freed or moved around during the session. I just wonder if this bug couldn't somehow be turned into a proper feature?

I'm tempted to keep my patches as a temporary stop gap measure in some kernel on GitHub, so my users could use them to get NVidia+NVidia or at least old AMD+AMD setups with radeon-kms + ati-ddx working well enough for their research work until some proper solution comes around. But if you think there is some major way how this could blow up, corrupt data, hang/crash during normal use then better not. I don't know how many of my users have such systems, as my advice to them so far was to "stay the hell away from anything with hybrid graphics/Optimus/Enduro in its name if they value their work". Now i could change my purchase advice to "anything hybrid with a Intel iGPU is probably ok in terms of correctness/timing/performance for not too demanding performance needs".

The latest versions of DCE support scanning out from GTT, so that might
be a good solution at least for Carrizo and newer APUs, not sure it
makes sense for dGPUs though.

That would be good to have. But that means DCE-11 or later only? What is the constraint on older parts, does it need contiguous memory? I personally don't care about the dGPU case, i only use these dGPUs for testing because i don't have access to any real Enduro laptops with APUs.

-mario



AMD, as tested with dual Radeon HD-5770 seems to be fast as prime
importer/display gpu, but very slow as prime exporter/render offload,
e.g., taking 16 msecs to get a 1920x1080 framebuffer into RAM. Seems
that Mesa's blitImage function is the slow bit here. On r600 it seems
to draw a textured triangle strip to detile the gpu renderbuffer and
copy it into GTT. As drawing a textured fullscreen quad is normally
much faster, something special seems to be going on there wrt. DMA?

Maybe the rasterization as two triangles results in bad PCIe bandwidth
utilization. Using the asynchronous DMA engine for these transfers would
probably be ideal, but having the 3D engine rasterize a single rectangle
(either using the rectangle primitive or a large triangle with scissor)
might already help.


_______________________________________________
dri-devel mailing list
dri-devel@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/dri-devel




[Index of Archives]     [Linux DRI Users]     [Linux Intel Graphics]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [XFree86]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Linux Kernel]     [Linux SCSI]     [XFree86]
  Powered by Linux