"Fixes" for page flipping under PRIME on AMD & nouveau

Mario Kleiner <mario.kleiner.de@xxxxxxxxx> · Wed, 17 Aug 2016 18:12:55 +0200

Hi,

i spent some time playing with DRI3/Present + PRIME for testing
how well it works for Optimus/Enduro style setups wrt. page flipping
on the current kernel/mesa/xorg. I want page flipping, because
neuroscience/medical applications need the reliable timing/timestamping
and tear free presentation we currently only can get via page
flipping, but not the copyswap path.

Intel as display gpu + nouveau for render offload worked nicely
on intel-ddx with page flipping, proper timing, dmabuf fence sync
and all.

AMD uses copy swaps because radeon/amdgpu kms can't switch the
scanout mode from tiled to linear on the fly during flips. That's
a todo in itself. For the moment i used the ati-ddx with Option
"ColorTiling/ColorTiling2D" "off" to force my pair of old Radeon
HD-5770's into linear mode so page flipping can be used for
prime. The current modesetting-ddx will use page flipping in
any case as it doesn't detect the tiling format mismatch.

nouveau uses page flips.

Turns out that prime + page flipping currently doesn't work
on nouveau and amd. The first offload rendered images from
the imported dmabufs show up properly, but then the display
is stuck alternating between the first two or three rendered
frames.

The problem is that during the pageflip ioctl we pin the
dmabuf into VRAM in preparation for scanout, then unpin it
when we are done with it at next flip, but the buffer stays
in the VRAM memory domain. Next time we flip to the buffer
again, the driver skips the DMA copy from GTT to VRAM during
pinning, because the buffers content apparently already resides
in VRAM. Therefore it doesn't update the VRAM copy with the updated
dmabuf content in system RAM, so freshly rendered frames from the
prime export/render offload gpu never reach the display gpu and one
only sees stale images.

The attached patches for nouveau and radeon kms seem to work
pretty ok, page flipping works, display updates, tear-free,
dmabuf fence sync works, onset timing/timestamping is correct.
They simply pin the buffer back into GTT, then unpin, to force
a move of the buffer into the GTT domain, and thereby force the
following pin to do a new copy from GTT -> VRAM. The code tries
to avoid a useless copy from VRAM -> GTT during the pin op.

However, the approach feels very much like a hack, so i assume
this is not the proper way of doing it? I looked what ttm has
to offer, but couldn't find anything elegant and obvious. Maybe
there is a way to evict a bo without actually copying data back
to RAM? Or to invalidate the VRAM copy as stale? Maybe i just
missed something, as i'm not very familiar with ttm.

Thoughts or suggestions?

Another insight with my hacks is so far that nouveau seems to
be fast as prime exporter/renderoffload, but rather slow as
display gpu/prime importer, as tested on a 2008 or 2009
MacBookPro dual-Nvidia laptop.

AMD, as tested with dual Radeon HD-5770 seems to be fast as prime
importer/display gpu, but very slow as prime exporter/render offload,
e.g., taking 16 msecs to get a 1920x1080 framebuffer into RAM. Seems
that Mesa's blitImage function is the slow bit here. On r600 it seems
to draw a textured triangle strip to detile the gpu renderbuffer and
copy it into GTT. As drawing a textured fullscreen quad is normally
much faster, something special seems to be going on there wrt. DMA?
However, i don't have a realistic real Enduro test setup with AMD
iGPU + dGPU, only this cobbled together dual HD-5770's in a MacPro,
so this could be wrong.

thanks,
-mario

_______________________________________________
dri-devel mailing list
dri-devel@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/dri-devel