Re: dma-buf non-coherent mmap

Thomas Hellstrom <thellstrom@xxxxxxxxxx> · Mon, 04 Nov 2013 08:53:50 +0100

On 11/01/2013 02:32 PM, Lucas Stach wrote:
Am Freitag, den 01.11.2013, 09:22 -0400 schrieb Rob Clark:
On Fri, Nov 1, 2013 at 6:17 AM, Daniel Vetter <daniel.vetter@xxxxxxxx> wrote:
On Fri, Nov 1, 2013 at 11:03 AM, Lucas Stach <l.stach@xxxxxxxxxxxxxx> wrote:
GStreamer needs _some_ way of accessing the buffer contents with the
CPU, this doesn't necessarily have to be the dumb mmap we have right
now.
DMA-BUF support in GSt is not really finished (I know we have a lot of
patches internally to make it actually work with anything, which are
trickling upstream right now), so this might be a good time to hammer
out how it should be done, so we can adapt GSt before people start to
rely on the dma-buf fallback mmap.

I would think the bracketed mmap idea that was thrown around by Rob some
time ago when the mmap topic first surfaced is simple enough that we
don't need additional userspace helpers and should be enough to make the
coherency issue disappear.
Yeah, if we add a BEGIN_MMAP/END_MMAP ioctl or so to the dma-buf mmap
stuff and simply demand that userspace calls that we'd have something
half-sane at least. Since we currently don't really have any real
users we could still do this abi change I think.

One open issue is whether the BEGIN_MMAP ioctl should also synchronize
with any outstanding gpu access. Imo it should, since that's pretty
much how all the current drm drivers work, but maybe we should reserve
space for a few flags so that this can be extended later on - Android
seems to be pretty insistent on using explicit syncpoints everywhere
instead of implicit synchronization. Or maybe we should have the flag
already, but reject implicit syncing until Maarten's dma_fence stuff
is in and enabled, dunno. Adding him to cc.
I don't think we need that right now. We may prepare for implicit sync
by having the flag ready in the API, but don't force the exporters to
implement it right now.

I suppose I wouldn't mind if we made BEGIN/END_MMAP as sort of clones
of the CPU_PREP/FINI ioctls I have in msm.. where the CPU_PREP takes a
flag/bitmask to indicate READ/WRITE/NOSYNC

That doesn't really help with partial buffer access, like you'd get if
you were using the buffer for vertex upload.  Although not really sure
if we'd be seeing dmabuf's for that sort of use-case.  We could make a
more elaborate kernel interface that maps better to all the different
cases in pipe_transfer.  Or we could just say that that is overkill.
I don't really have strong opinion about that (so someone who does,
send an RFC).  But I do agree with the "barriers only, not locks".

I just checked back with our GStreamer guy and he thinks that for the
use-cases he sees today it would make a lot of sense to have some way to
indicate that userspace is only going to access a partial range of the
buffer. I think this may be a crucial optimization on all kinds of
devices where you have to ensure coherency manually.

Personally I would like to have the same kind of partial access
indicators in DRMs cpu_prep/fini to optimize all the userspace
suballocations we have today.

I think we are all on the same page with the barriers only thing.

Regards,
Lucas

OK, so I've attached something that would work on top of the current 
mmap() interface, and that would basically be NOPs on
coherent architectures. Comment welcome. In order for this to be 
meaningful, we would need to make it mandatory if mmap is
implemented. One outstanding issue Is how this might play with cache 
line alignment. We might be able to require the implementation to handle 
this, making sure that synced regions align both to the device cache and 
the cpu cache, and that, for example, a
sync_for_cpu(DMA_TO_DEVICE) may actually result in device caches in the 
pad area being flushed...

Region pwrites / preads having basically an identical interface (except 
the linear flag) but also taking a user pointer and stride, would also 
work just fine. Linear preads / pwrites could, I guess, be implemented 
the intended way using write / read syscalls on the file-descriptor.

Still, I must say I'd prefer if we could just avoid mmap and the major 
drivers didn't implement it.
Second on my list is probably pwrites / preads, and third this proposed 
interface. My preference order is based on the way vmwgfx handles
this stuff from the Xorg driver today, doing unsynchronized region 
preads / pwrites from a shadow user-space buffer, keeping track on 
dirtied regions on both sides. (If it works for the Xorg driver, it 
should work for anything else :)).

Thanks,
/Thomas

enum dma_buf_data_direction {
	DMA_BUF_BIDIRECTIONAL,
	DMA_BUF_TO_DEVICE,
	DMA_BUF_FROM_DEVICE
};

/**
 * struct dma_buf_sync_arg - Argument to 
 *
 * @start_x:   Upper left X coordinate of area to be synced.
 * Unit is format-dependent
 * @start_y:   Upper left Y coordinate of area to be synced.
 * Unit is format-dependent.
 * @width:     Width of area to be synced. Grows to the right.
 * Unit is format-dependent.
 * @height:    Height of area to be synced. Grows downwards.
 * Unit is format-dependent.
 * @direction: Intended transfer direction of data.
 * @flags:     Flags to tune the synchronizing behaviour.
 */

struct dma_buf_sync_region_arg {
	__u32 buf_identifier;
	__u32 start_x;
	__u32 start_y;
	__u32 width;
	__u32 height;
	enum dma_buf_data_direction direction;
	__u32 flags;
	__u32 pad;
};

/**
 * Force sync any outstanding rendering of the exporter before returning.
 * This is strictly a performance hint. The user may omit this flag to
 * speed up execution of the call, if it is known that previous rendering
 * affecting (by read or write) the synchronized area has already finished. 
 */
#define DMA_BUF_SYNC_FLAG_SYNC  (1 << 0);

/**
 * Treat @start_x as byte offset into buffer and @width as byte
 * synchronization length. The values of @start_y and @height are ignored.
 * (Separate ioctl?)
 */
#define DMA_BUF_SYNC_FLAG_LINEAR (1 << 1);

/**
 * Allow the implementation to coalesce sync_for_device calls, until either
 * a) An explicit flush
 * b) A sync for cpu call with DMA_BUF_BIDIRECTIONAL or DMA_BUF_TO_DEVICE
 *
 * Note: An implementation may choose to ignore this flag.
 */
#define DMA_BUF_SYNC_FLAG_COALESCE (1 << 2); 

/**
 * IOCTLS-
 *
 * Kernel waits should, if possible, be performed interruptible, and the
 * ioctl may sett errno to EINTR if the ioctl needs to be restarted.
 * To be discussed: Any sync operation may not affect areas outside the
 * region indicated. (Good for vmwgfx, but plays ill with cache line alignment)
 */

/**
 * Sync for CPU.
 */
#define DMA_BUF_SYNC_REGION_FOR_CPU             \ 
_IOW(DMA_BUF_BASE, 0, struct dma_buf_sync_region_arg)

/**
 * Sync for device. This is the default state of a dma-buf. 
 */
#define DMA_BUF_SYNC_REGION_FOR_DEVICE				\
_IOW(DMA_BUF_BASE, 1, struct dma_buf_sync_region_arg)

/**
 * Flush any coalesced SYNC_REGION_FOR_DEVICE
 */
#define DMA_BUF_SYNC_REGION_FLUSH		\
_IOW(DMA_BUF_BASE, 2, __u32)
_______________________________________________
dri-devel mailing list
dri-devel@xxxxxxxxxxxxxxxxxxxxx
http://lists.freedesktop.org/mailman/listinfo/dri-devel