Hi Nicolas On Mon, 11 May 2020 at 19:43, Nicolas Saenz Julienne <nsaenzjulienne@xxxxxxx> wrote: > > Hi Dave, > some questions/comments. > > On Mon, 2020-05-04 at 12:25 +0300, Laurent Pinchart wrote: > > From: Dave Stevenson <dave.stevenson@xxxxxxxxxxxxxxx> > > > > Add Broadcom VideoCore Shared Memory support. > > > > This new driver allows contiguous memory blocks to be imported > > into the VideoCore VPU memory map, and manages the lifetime of > > those objects, only releasing the source dmabuf once the VPU has > > confirmed it has finished with it. > > > > Driver upported from the RaspberryPi BSP at revision: > > 890691d1c996 ("staging: vc04_services: Fix vcsm overflow bug when counting > > transactions") > > forward ported to recent mainline kernel version. > > > > Signed-off-by: Naushir Patuck <naush@xxxxxxxxxxxxxxx> > > Signed-off-by: Dave Stevenson <dave.stevenson@xxxxxxxxxxxxxxx> > > Signed-off-by: Jacopo Mondi <jacopo@xxxxxxxxxx> > > --- > > [...] > > > + > > +/* Import a dma_buf to be shared with VC. */ > > +int > > +vc_sm_cma_import_dmabuf_internal(struct vc_sm_privdata_t *private, > > + struct dma_buf *dma_buf, > > + int fd, > > + struct dma_buf **imported_buf) > > +{ > > + DEFINE_DMA_BUF_EXPORT_INFO(exp_info); > > + struct vc_sm_buffer *buffer = NULL; > > + struct vc_sm_import import = { }; > > + struct vc_sm_import_result result = { }; > > + struct dma_buf_attachment *attach = NULL; > > + struct sg_table *sgt = NULL; > > + dma_addr_t dma_addr; > > + int ret = 0; > > + int status; > > + > > + /* Setup our allocation parameters */ > > + pr_debug("%s: importing dma_buf %p/fd %d\n", __func__, dma_buf, fd); > > + > > + if (fd < 0) > > + get_dma_buf(dma_buf); > > + else > > + dma_buf = dma_buf_get(fd); > > + > > + if (!dma_buf) > > + return -EINVAL; > > + > > + attach = dma_buf_attach(dma_buf, &sm_state->pdev->dev); > > + if (IS_ERR(attach)) { > > + ret = PTR_ERR(attach); > > + goto error; > > + } > > + > > + sgt = dma_buf_map_attachment(attach, DMA_BIDIRECTIONAL); > > + if (IS_ERR(sgt)) { > > + ret = PTR_ERR(sgt); > > + goto error; > > + } > > + > > + /* Verify that the address block is contiguous */ > > + if (sgt->nents != 1) { > > + ret = -ENOMEM; > > + goto error; > > + } > > + > > + /* Allocate local buffer to track this allocation. */ > > + buffer = kzalloc(sizeof(*buffer), GFP_KERNEL); > > + if (!buffer) { > > + ret = -ENOMEM; > > + goto error; > > + } > > + > > + import.type = VC_SM_ALLOC_NON_CACHED; > > + dma_addr = sg_dma_address(sgt->sgl); > > + import.addr = (u32)dma_addr; > > + if ((import.addr & 0xC0000000) != 0xC0000000) { > > + pr_err("%s: Expecting an uncached alias for dma_addr %pad\n", > > + __func__, &dma_addr); > > + import.addr |= 0xC0000000; > > + } > > Just so we don't forget about it, this shouldn't be needed once dma-ranges are > fixed. Indeed not, but we've had enough issues with dma-ranges going missing that screaming and shouting about it is a good thing. > > + import.size = sg_dma_len(sgt->sgl); > > + import.allocator = current->tgid; > > + import.kernel_id = get_kernel_id(buffer); > > + > > + memcpy(import.name, VC_SM_RESOURCE_NAME_DEFAULT, > > + sizeof(VC_SM_RESOURCE_NAME_DEFAULT)); > > + > > + pr_debug("[%s]: attempt to import \"%s\" data - type %u, addr %pad, size > > %u.\n", > > + __func__, import.name, import.type, &dma_addr, import.size); > > + > > + /* Allocate the videocore buffer. */ > > + status = vc_sm_cma_vchi_import(sm_state->sm_handle, &import, &result, > > + &sm_state->int_trans_id); > > + if (status == -EINTR) { > > + pr_debug("[%s]: requesting import memory action restart > > (trans_id: %u)\n", > > + __func__, sm_state->int_trans_id); > > + ret = -ERESTARTSYS; > > + private->restart_sys = -EINTR; > > + private->int_action = VC_SM_MSG_TYPE_IMPORT; > > + goto error; > > + } else if (status || !result.res_handle) { > > + pr_debug("[%s]: failed to import memory on videocore (status: > > %u, trans_id: %u)\n", > > + __func__, status, sm_state->int_trans_id); > > + ret = -ENOMEM; > > + goto error; > > + } > > + > > + mutex_init(&buffer->lock); > > + INIT_LIST_HEAD(&buffer->attachments); > > + memcpy(buffer->name, import.name, > > + min(sizeof(buffer->name), sizeof(import.name) - 1)); > > + > > + /* Keep track of the buffer we created. */ > > + buffer->private = private; > > + buffer->vc_handle = result.res_handle; > > + buffer->size = import.size; > > + buffer->vpu_state = VPU_MAPPED; > > + > > + buffer->imported = 1; > > + buffer->import.dma_buf = dma_buf; > > + > > + buffer->import.attach = attach; > > + buffer->import.sgt = sgt; > > + buffer->dma_addr = dma_addr; > > + buffer->in_use = 1; > > + buffer->kernel_id = import.kernel_id; > > + > > + /* > > + * We're done - we need to export a new dmabuf chaining through most > > + * functions, but enabling us to release our own internal references > > + * here. > > + */ > > + exp_info.ops = &dma_buf_import_ops; > > + exp_info.size = import.size; > > + exp_info.flags = O_RDWR; > > + exp_info.priv = buffer; > > + > > + buffer->dma_buf = dma_buf_export(&exp_info); > > Could you comment on the need for this second dma_buf? I've only reviewed code > related to mmal-vchiq imports, but it seems to me that it would be saner to do > the unmapping and unattaching explicitly as opposed to having this second > buffer refcount hit 0. Although, I can imagine this being needed for the > userspace interface. Indeed, as it is needed for the userspace interface it seemed to make more sense to have common handling rather than two code paths doing nearly the same thing but in different ways. Downstream we need a userspace import at least to allow MMAL to set up zero copy, so unless it raises any real objections then it would be useful to keep it. > When you talk about moving to dmabuf heaps, I've pictured a specific dmabuf > heap for vc4 that takes care of all the importing and unimporting (aside from > cma allocations). Am I right? If so, I'm pretty confident we can do away with > this. (Note I'm talking about the VideoCore4 VPU and other blocks, and not the vc4 DRM/KMS and V3D drivers) No, I'm looking at using the existing cma_heap driver to do the allocations, and then this driver will import them and handle the lifetime on behalf of the VPU. There's no need for VPU allocations to be split off into yet another heap. One of the things we are trying to get away from is having the gpu_mem reserved lump that Linux can't get access to at all, so allocating from the CMA heap and importing to the VPU avoids that. I'll give some history here, which also hopefully covers your query over switching mmal-vchiq to zero copy. Almost all the VC4 blocks need contiguous memory, so fragmentation was an issue. To resolve that we (back in Broadcom days) had the "relocatable heap" - allocations that needed to be locked before access and unlocked after. Unlocked blocks could be copied and moved around to free up larger contiguous blocks. These allocations use a handle instead of a pointer, and have internal refcounting etc. Basically providing some of the features of an MMU when you don't have one. The original VCSM driver allowed userspace to make a relocatable heap allocation, lock it, and the kernel to map the relevant pages into the ARM memory space. Now you have a shared buffer, and VCHIQ no longer has to copy the data back and forth. (Cache flushing was also handled). So MMAL in zero copy mode passes the VPU relocatable heap handle across in the VCHIQ message, not a pointer to the actual data. VCSM did the allocation on behalf of the MMAL client, and provides the mapping and VPU handle to the buffer. This still leaves the allocation being made from gpu_mem though. The rewrite (vc-sm-cma) was to make use of an import feature into the relocatable heap, termed internally as mem wrapping. Take a CMA allocation made by something, pass the DMA address and size across to the VPU, and it can insert it as a relocatable heap object that can be used in exactly the same way gpu_mem allocations. gpu_mem can now be shrunk in size :-) It was using a dma-buf as a convenient object to manage the allocation, and handle importing buffers allocated by other subsystems Note that we still have refcounting internally to the relocatable heap, so at the point the client says it has finished with it, the VPU may not have done. When the last relocatable heap reference is released, the kernel gets a callback (VC_SM_MSG_TYPE_RELEASED), and it is only at that point that it is safe to drop the reference to the imported dmabuf. V4L2 can do the relevant import and wrapping to a relocatable heap handle as part of the buffer passing. MMAL needs to do it manually from userspace as VCHIQ is the only in-kernel service that it uses, hence we need an import ioctl and free mechanism (if the handle is a dmabuf, then that's close). >From a platform level it would be nice to have the userspace ioctl for importing a dmabuf in mainline, however it isn't necessary for the V4L2 use cases that we're trying to upstream here. The driver without userspace API would look pretty much like the one in [1]. I'll try and update that to include the basic import userspace API to give a comparison. I don't mind which way this goes as to whether the userspace ioctl remains as downstream patches, but losing the dmabuf as the handle within vc-sm-cma will make that patch huge, and they're almost guaranteed to diverge. Ignore the caching ioctls - they're irrelevant. I hope that makes the situation a little clearer. Dave [1] https://github.com/6by9/linux/tree/staging_next_upstreaming_apr20/drivers/staging/vc04_services/vc-sm-cma