On 16/12/16 02:24, Laurent Pinchart wrote:
Hello, This is a rebased version of the vb2 cache hints support patch series posted by Sakari more than a year ago. The patches have been modified as needed by the upstream changes and received the occasional small odd fix but are otherwise not modified. Please see the individual commit messages for more information. The videobuf2 memory managers use the DMA mapping API to handle cache synchronization on systems that require them transparently for drivers. As cache operations are expensive, system performances can be impacted. Cache synchronization can't be skipped altogether if we want to retain correct behaviour, but optimizations are possible in cases related to buffer sharing between multiple devices without CPU access to the memory. The first optimization covers cases where the memory never needs to be accessed by the CPU (neither in kernelspace nor in userspace). In those cases, as no CPU memory mappings exist, cache synchronization can be skipped. The situation could be detected in the kernel as we have enough information to determine whether CPU mappings for kernelspace or userspace exist (in the first case because drivers should request them explicitly, in the second case because the mmap() handler hasn't been invoked). This optimization is not implemented currently but should at least be prototyped as it could improve performances automatically in a large number of cases. The second class of optimizations cover cases where the memory sometimes needs to be accessed by the CPU. In those cases memory mapping must be created and cache handled, but cache synchronization could be skipped for buffer that are not touched by the CPU. By default the following cache synchronization operations need to be performed related to the buffer management ioctls. For simplicity means of QBUF below apply to buf VIDIOC_QBUF and VIDIOC_PREPARE_BUF. | QBUF | DQBUF ---------------------------------------- CAPTURE | Invalidate | Invalidate (*) OUTPUT | Clean | - (*) for systems using speculative pre-fetching only The following cases can be optimized. 1. CAPTURE, the CPU has not written to the buffer before QBUF Cache invalidation can be skipped at QBUF time, but becomes required at DQBUF time on all systems, regardless of whether they use speculative prefetching. 2. CAPTURE, the CPU will not read from the buffer after DQBUF Cache invalidation can be skipped at DQBUF time. 3. CAPTURE, combination of (1) and (2) Cache invalidation can be skipped at both QBUF and DQBUF time. 4. OUTPUT, the CPU has not written to the buffer before QBUF Cache clean can be skipped at QBUF time. The kernel can't detect thoses situations automatically and thus requires hints from userspace to decide whether cache synchronization can be skipped. It should be noted that those hints might not be honoured. In particular, if userspace hints that it hasn't touched the buffer with the CPU, drivers might need to perform memory accesses themselves (adding JPEG or MPEG headers to buffers is a common case where CPU access could be needed in the kernel), in which case the userspace hints will be ignored. Getting the hints wrong will result in data corruption. Userspace applications are allowed to shoot themselves in the foot, but driver are responsible for deciding whether data corruption can pose a risk to the system in general. For instance if the device could be made to crash, or behave in a way that would jeopardize system security, reliability or performances, when fed with invalid data, cache synchronization shall not be skipped solely due to possibly incorrect userspace hints. The V4L2 API defines two flags, V4L2-BUF-FLAG-NO-CACHE-INVALIDATE and V4L2_BUF_FLAG_NO_CACHE_SYNC, that can be used to provide cache-related hints to the kernel. However, no kernel has ever implemented support for those flags that are thus most likely unused. A single flag is enough to cover all the optimization cases described above, provided we keep track of the flag being set at QBUF time to force cache invalidation at DQBUF time for case (1) if the flag isn't set at DQBUF time. This patch series thus cleans up the userspace API and merges both flags into a single one. One potential issue with case (1) is that cache invalidation at DQBUF time for CAPTURE buffers isn't fully under the control of videobuf2. We can instruct the DMA mapping API to skip cache handling, but we can't force it to invalidate the cache in the sync_for_cpu operation for non speculative prefetching systems. Luckily, on ARM32 the current implementation always invalidates the cache in __dma_page_dev_to_cpu() for CAPTURE buffers so we are safe fot now. However, this is documented by a FIXME comment that might lead to someone fixing the implementation in the future. I believe we will have to the problem at the DMA mapping level, the userspace hint API shouldn't be affected. This RFC patch set achieves two main objectives: 1. Respect cache flags passed from the user space. As no driver nor videobuf2 has (ever?) implemented them, the two flags are replaced by a single one (V4L2_BUF_FLAG_NO_CACHE_SYNC) and the two old flags are deprecated. This is done since a single flag provides the driver with enough information on what to do. (Patches 01/11 to 05/11, see patch 04/11 for more information.) 2. Allow a driver using videobuf2 dma-contig memory type to choose whether it prefers coherent or non-coherent CPU access to buffer memory for MMAP and USERPTR buffers. This could be later extended to be specified by the user, and per buffer if needed. (Patches 06/11 and 11/11). Only dma-contig memory type is changed but the same could be done to dma-sg as well. Sakari offered in v1 to add it to the set if people are happy with the changes to dma-contig. Note should be taken that the series performs cache optimization for MMAP buffers only. DMABUF imported buffers have their cache synchronization handled by the exported through the dma_buf_map_attachment() and dma_buf_unmap_attachment() functions, and dma-buf lacks an API to perform memory synchronization without unmapping and remapping the buffers. This is not a blocker as far as this patch series is concerned, but importing buffers (usually exported by the CPU) is such an important use case that we can't considered the cache optimization problem anywhere close to being solved if we don't address this case. I plan to start addressing the problem in January or February 2017, feedback on this point will thus be appreciated.
This series looks good to me, so: Acked-by: Hans Verkuil <hans.verkuil@xxxxxxxxx> Regards, Hans
Sakari Ailus (10): vb2: Rename confusingly named internal buffer preparation functions vb2: Move buffer cache synchronisation to prepare from queue vb2: Move cache synchronisation from buffer done to dqbuf handler v4l: Unify cache management hint buffer flags vb2: Improve struct vb2_mem_ops documentation; alloc and put are for MMAP vb2: dma-contig: Remove redundant sgt_base field vb2: dma-contig: Don't warn on failure in obtaining scatterlist vb2: dma-contig: Move vb2_dc_get_base_sgt() up vb2: dma-contig: Let drivers decide DMA attrs of MMAP and USERPTR bufs vb2: dma-contig: Add WARN_ON_ONCE() to check for potential bugs Samu Onkalo (1): v4l2-core: Don't sync cache for a buffer if so requested Documentation/media/uapi/v4l/buffer.rst | 24 ++-- .../media/uapi/v4l/vidioc-prepare-buf.rst | 5 +- drivers/media/v4l2-core/videobuf2-core.c | 120 ++++++++++++------ drivers/media/v4l2-core/videobuf2-dma-contig.c | 135 ++++++++++++++------- drivers/media/v4l2-core/videobuf2-v4l2.c | 14 ++- include/media/videobuf2-core.h | 43 ++++--- include/trace/events/v4l2.h | 3 +- include/uapi/linux/videodev2.h | 7 +- 8 files changed, 228 insertions(+), 123 deletions(-)
_______________________________________________ dri-devel mailing list dri-devel@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/dri-devel