[RFC] video-buffer management optimizations

Guennadi Liakhovetski <g.liakhovetski@xxxxxx> · Mon, 21 Feb 2011 10:08:13 +0100 (CET)

Agenda:
=======

User-space applications need more flexibility in managing their video-
(multimedia-) buffers to achieve their goals. A popular example is a photo-
camera with a preview. Currently the application has to first enter the
preview mode:

* set the preview format
* allocate buffers (if they are allocated in the kernel, i.e., per MMAP, their
  size will be calculated, based on frame size and pixel format). If the READ or
  the USERPTR method is used, the buffer allocation task is handled in the user-
  space.
* queue buffers
* start streaming

Then, at some point during the running preview, the user presses the release
button, at which time the application has to

* stop streaming
* free the queue (REQBUFS with .count = 0)
* set the still image format
* allocate buffers
* queue buffers
* start streaming

The above switching takes a long time. A part of it comes from having to wait
for the current frame completion. A reasonable preview can be expected to run
at 25 / 30fps, i.e., spend 40 / 33ms per frame. The second component,
contributing to the delay, is the memory management. Depending on the type of
memory used and the camera resolution and the data format, it can take hundreds
of milliseconds to switch to a new buffer queue.

Another popular use-case is passing buffers between hardware entities with
or without intermediate data processing in the user-space. Depending, whether
CPU processing is required or not, caches do or do not have to be invalidated.
In this case buffers have to either be allocated from a globally-accessible pool
(buffer pool), or USERPTR or READ has to be used. Presently, no such global pool
solution exists in the kernel, and allocating buffers in user-space does not
normally support specifying caching requirements.

Further, in some cases performance can be critical for the applpication, in
others it might be essential to try to save memory.

Goals:
======

Minimize the delay by eliminating expansive memory allocation and cache
invalidation operations.

Possible memory configurations and requirements:
================================================

Currently there are three videobuffer management algorithms in the kernel:

* VMALLOC
* DMA_CONTIG
* DMA_SG

in videobuf version 1 and 2. Both DMA_* helpers are, logically, used, when DMA
is used to fill buffers. DMA_SG buffers are allocated with a page-size
granularity, which makes their reuse simple. DMA_CONTIG buffers are more
difficult to allocate (without an IOMMU) and to re-use for different frame sizes
due to likely memory fragmentation.

Further, there are three ways in which data can be retrieved by applications:

* using mmap(2) with buffers of type V4L2_MEMORY_MMAP
* using user-space buffers of type V4L2_MEMORY_USERPTR
* using read(2) with generic user-space memory

Finally, the obtained data can be used in one of the following two ways:

* using CPU once to read out the data
* using CPU multiple times to process the data in-place
* passing them further to other bus-masters for hardware processing / output

If the cpu-access is required, cache has to be invalidated before passing
buffers back to the user per DQBUF or upon queuing them per QBUF. If the buffers
have to be accessed only ones by the CPU, it might be better to allocate buffers
in uncached memory and let applications try to compensate uncached access by
implementing some low-level optimizations. OTOH, even in this case applications
can benefit from pre-fetching the data a cache-line at a time. Generally,
using non-coherent DMA-able contiguous memory and implementing a suitable cache
management seems to be preferable over using coherent memory.

Use-cases:
==========

The most interesting case for us, is when the buffers are allocated in the V4L
driver context, i.e., the V4L2_MEMORY_MMAP method is used.

Example 1:
----------

MMAP, DMA-Contiguous, CPU access required, optimize for performance.

Best result is achieved by pre-allocating and pre-queuing two buffer sizes.
Pre-queuing them eliminates the expensive cache invalidation when switching to
the still image mode. Switching is then performed in the following steps:

* stop streaming
* switch to the big-buffer queue
* set the still image format
* start streaming

Required extensions:

1. allocate multiple queues per device
2. switch between queues
3. specify buffer size explicitly, before the image format is known to the
   driver

Example 2:
----------

MMAP, DMA-Contiguous, CPU access required, optimize for memory.

In this case we can sacrifice some performance, by taking the cache-invalidation
hit, but save memory by re-using the buffers. The easiest case - allocate a huge
chunk of memory, use it either for smaller, or for bigger buffers. CMA can be
useful for this. The advantage is, that new buffers do not have to be allocated,
only the same chunk of memory shall be split into differently sized buffers. A
single buffer queue shall be used.

Required extensions:

REQBUFS shall be extended to take a "persistent" flag. If set, the following
REQBUFS(0) will not release buffers, but keep the memory for a possible
re-use with the new format. If impossible, then it shall free the old buffers
and allocate new ones. If supported by the specific driver, it can also decide
to use CMA.

Example 3:
----------

MMAP, DMA-Contiguous, no CPU access required, optimize for performance.

Since no CPU access is required, and therefore no cache invalidation, persistent
buffers can be used without a major performance hit. So, either of the above two
schemes - with a single or with multiple queues - can be used. Additionally,
the driver has to be instructed to skip cache-invalidation.

Required extensions:

It has to be possible, to tell the driver to skip cache-invalidation

Example 4:
----------

MMAP, DMA-Contiguous, no CPU access required, optimize for memory.

Skip cache-invalidation, if possible - re-use buffers, otherwise re-allocate
them, single queue.

Example 5:
----------

MMAP, DMA-SG.

Cache-invalidation and multiple queues can be used exactly as with contiguous
buffers, depending on the use-pattern. Memory re-use becomes easier even
without CMA, due to smaller single buffers.

Example 6:
----------

MMAP, VMALLOC.

VMALLOC is usually used either by virtual drivers like vivi.c, or by USB camera
drivers, in which case the actual memory, used to transfer the data from the
camera is managed by the USB layer, and the V4L driver uses the CPU to possibly
process the data and copy it into the buffers. Performance can be improved by
avoiding buffer re-allocation during the switch by either using two queues or
re-using the memory.

Example 7:
----------

USERPTR.

The "skip cache-invalidate" flag and multiple queues can be used to improve
performance. CMA and other kinds of memory re-use can only be used on modules,
where the memory has originally been allocated.

Example 8:
----------

READ.

Only the "skip cache-invalidate" flag makes sense.

Conclusion.
===========

The following functionality is required to support optimizations, described
above:

1. multiple video-buffer queues per device / filehandle: their allocation and
   switching between them.
2. either the above queue-alloc method, or for the VIDIOC_REQBUFS ioctl() has to
   accept an explicit "buffer-size" parameter
3. a "skip-cache-invalidate" flag for the above queue-alloc method, or for the
   VIDIOC_REQBUFS ioctl()
4. a "persistent" flag for VIDIOC_REQBUFS for buffer re-use

References.
===========

An earlier RFC [1] addresses some of the issues described here, but attacks them
from a slightly different angle. None the less, "global video buffers pools,"
described there, can be used, e.g., as a backing solution for some of the
proposed extensions.

The Contiguous Memory Allocator has last been submitted in v8 on 15 December
2010 [2].

[1] http://lwn.net/Articles/353043/
[2] http://thread.gmane.org/gmane.linux.kernel.mm/56855

Please, comment.

Thanks
Guennadi
---
Guennadi Liakhovetski, Ph.D.
Freelance Open-Source Software Developer
http://www.open-technology.de/
--
To unsubscribe from this list: send the line "unsubscribe linux-media" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html