Re: [Linaro-mm-sig] [RFCv2 PATCH 2/9 - 4/4] v4l: vb2-dma-contig: update and code refactoring

Jerome Glisse <j.glisse@xxxxxxxxx> · Tue, 27 Mar 2012 12:45:23 -0400

On Tue, Mar 27, 2012 at 11:01 AM, Laurent Pinchart
<laurent.pinchart@xxxxxxxxxxxxxxxx> wrote:
> Hi Tomasz,
>
> On Thursday 22 March 2012 16:58:27 Tomasz Stanislawski wrote:
>> On 03/22/2012 03:42 PM, Laurent Pinchart wrote:
>> > On Thursday 22 March 2012 14:36:33 Tomasz Stanislawski wrote:
>> >> On 03/22/2012 11:50 AM, Laurent Pinchart wrote:
>> >>> On Thursday 22 March 2012 11:02:23 Laurent Pinchart wrote:
>
> [snip]
>
>> >>>>  static void *vb2_dc_alloc(void *alloc_ctx, unsigned long size)
>> >>>>  {
>> >>>>
>> >>>>          struct device *dev = alloc_ctx;
>> >>>>          struct vb2_dc_buf *buf;
>> >>>>
>> >>>> +        int ret;
>> >>>> +        int
>> >>>> n_pages;
>> >>>>
>> >>>>          buf = kzalloc(sizeof *buf, GFP_KERNEL);
>> >>>>          if (!buf)
>> >>>>
>> >>>>                  return ERR_PTR(-ENOMEM);
>> >>>>
>> >>>> -        buf->vaddr = dma_alloc_coherent(dev, size, &buf->dma_addr,
>> >
>> > GFP_KERNEL);
>> >
>> >>>> +        buf->dev = dev;
>> >>>> +        buf->size = size;
>> >>>> +        buf->vaddr = dma_alloc_coherent(buf->dev, buf->size, &buf->dma_addr,
>> >>>> +                GFP_KERNEL);
>> >>>> +
>> >>>> +        ret = -ENOMEM;
>> >>>>
>> >>>>          if (!buf->vaddr) {
>> >>>>
>> >>>> -                dev_err(dev, "dma_alloc_coherent of size %ld failed\n", size);
>> >>>> -                kfree(buf);
>> >>>> -                return ERR_PTR(-ENOMEM);
>> >>>> +                dev_err(dev, "dma_alloc_coherent of size %ld failed\n",
>> >>>> +                        size);
>> >>>> +                goto fail_buf;
>> >>>>
>> >>>>          }
>> >>>>
>> >>>> -        buf->dev = dev;
>> >>>> -        buf->size = size;
>> >>>> +        WARN_ON((unsigned long)buf->vaddr & ~PAGE_MASK);
>> >>>> +        WARN_ON(buf->dma_addr & ~PAGE_MASK);
>> >>>> +
>> >>>> +        n_pages = PAGE_ALIGN(size) >> PAGE_SHIFT;
>> >>>> +
>> >>>> +        pages = kmalloc(n_pages * sizeof pages[0], GFP_KERNEL);
>> >>>> +        if (!pages) {
>> >>>> +                printk(KERN_ERR "failed to alloc page table\n");
>> >>>> +                goto fail_dma;
>> >>>> +        }
>> >>>> +
>> >>>> +        ret = dma_get_pages(dev, buf->vaddr, buf->dma_addr, pages, n_pages);
>> >>>
>> >>> As the only purpose of this is to retrieve a list of pages that will be
>> >>> used to create a single-entry sgt, wouldn't it be possible to shortcut
>> >>> the code and get the physical address of the buffer directly ?
>> >>
>> >> The physical address should not be used since they are meaningless in a
>> >> context of different devices. It seams that only the list of pages is
>> >> more-or-less portable between different drivers.
>> >
>> > The pages are physically contiguous. The physical address of the first
>> > page is thus all you need.
>>
>> No. DMA-CONTIG buffers do not have to be physically contiguous. Please refer
>> below.
>>
>> > struct page and physical addresses can be used interchangeably in this
>> > case if I'm not mistaken. If you want to go with pages, you could use the
>> > first page only instead of the physical buffer address.
>>
>> Ok. There are bus addresses, physical addresses, DMA addresses and PFNs.
>> As I understand PFNs and 'struct page' can be interchanged, at least in one
>> direction. The PFNs are used to create a bus address, I mean addresses that
>> are recognized by a RAM chip. So a list of PFNs seams to be the most
>> portable way of describing the memory, isn't it?
>>
>> >> The physical address is already present in buf->dma_addr, but it is only
>> >> valid if the device has no MMU. Notice that vb2-dma-contig possess no
>> >> knowledge if MMU is present for a given device.
>> >
>> > That's why buf->dma_addr can't be considered as a physical address. It's
>> > only useful in the device context.
>>
>> ok
>>
>> >> The sg list is not going to be single-entry if the device is provided
>> >> with its own MMU.
>> >
>> > There's something I don't get then. vb2-dma-contig deals with physically
>> > contiguous buffers. The buffer is backed by physically contiguous pages,
>> > so the sg list should have a single entry.
>>
>> As I understand dma-contig deal with DMA contiguous buffers, it means
>> buffers that are contiguous from device point of view. Therefore those
>> buffers do NOT have to be physically contiguous if the device has its own
>> IOMMU.
>
> My bad. There was thus a misunderstanding to begin with.
>
> In the light of this new information, and (at least partially) sharing
> Daniel's opinion regarding dma_get_pages(), I think what we need here would be
> either
>
> -  a DMA API call that maps the memory to the importer device instead of
> dma_get_pages() + vb2_dc_pages_to_sgt(). The call would take a DMA memory
> "cookie" (see the "Minutes from V4L2 update call" mail thread) and a pointer
> to the importer device.
>
> - a DMA API call to retrieve a scatter list suitable to be passed to
> dma_map_sg(). This would be similar to dma_get_pages() +
> vb2_dc_pages_to_sgt().
>
> We also need to figure out whether the mapping call should be in the exporter
> or importer.
>
>> >>>> +        if (ret < 0) {
>> >>>> +                printk(KERN_ERR "failed to get buffer pages from DMA API\n");
>> >>>> +                goto fail_pages;
>> >>>> +        }
>> >>>> +        if (ret != n_pages) {
>> >>>> +                ret = -EFAULT;
>> >>>> +                printk(KERN_ERR "failed to get all pages from DMA API\n");
>> >>>> +                goto fail_pages;
>> >>>> +        }
>> >>>> +
>> >>>> +        buf->sgt_base = vb2_dc_pages_to_sgt(pages, n_pages, 0, 0);
>> >>>> +        if (IS_ERR(buf->sgt_base)) {
>> >>>> +                ret = PTR_ERR(buf->sgt_base);
>> >>>> +                printk(KERN_ERR "failed to prepare sg table\n");
>> >>>> +                goto fail_pages;
>> >>>> +        }
>> >>>
>> >>> buf->sgt_base isn't used in this patch. I would move the buf->sgt_base
>> >>> creation code to the patch that uses it then, or to its own patch just
>> >>> before the patch that uses it.
>> >>
>> >> Good point. The sgt_base is used by exporter only. Thanks for noticing
>> >> it.
>> >>
>> >>>> +
>> >>>> +        /* pages are no longer needed */
>> >>>> +        kfree(pages);
>> >>>>
>> >>>>          buf->handler.refcount = &buf->refcount;
>> >>>>          buf->handler.put = vb2_dc_put;
>> >
>> > [snip]
>> >
>> >>>>  /*********************************************/
>> >>>>  /*       callbacks for USERPTR buffers       */
>> >>>>  /*********************************************/
>> >>>>
>> >>>> +static inline int vma_is_io(struct vm_area_struct *vma)
>> >>>> +{
>> >>>> +        return !!(vma->vm_flags & (VM_IO | VM_PFNMAP));
>> >>>
>> >>> Isn't VM_PFNMAP enough ? Wouldn't it be possible (at least in theory) to
>> >>> get a discontinuous physical range with VM_IO ?
>> >>
>> >> Frankly, I found that that in get_user_pages flags are checked against
>> >> (VM_IO | VM_PFNMAP). Probably for noMMU (not no IOMMU) case it is
>> >> possible to get vma with VM_IO on and VM_PFNMAP off, isn't it?
>> >>
>> >> The problem is that this framework should work in both cases so this
>> >> check was added just in case :).
>> >
>> > OK. We can leave it here and deal with problems if they arise :-)
>> >
>> >>>> +}
>> >>>> +
>> >>>> +static int vb2_dc_get_pages(unsigned long start, struct page **pages,
>> >>>> +        int n_pages, struct vm_area_struct **copy_vma, int write)
>> >>>> +{
>> >>>> +        struct vm_area_struct *vma;
>> >>>> +        int n = 0; /* number of get pages */
>> >>>> +        int ret = -EFAULT;
>> >>>> +
>> >>>> +        /* entering critical section for mm access */
>> >>>> +        down_read(&current->mm->mmap_sem);
>> >>>
>> >>> This will generate AB-BA deadlock warnings if lockdep is enabled. This
>> >>> function is called with the queue lock held, and the mmap() handler
>> >>> which
>> >>> takes the queue lock is called with current->mm->mmap_sem held.
>> >>>
>> >>> This is a known issue with videobuf2, not specific to this patch. The
>> >>> warning is usually a false positive (which we still need to fix, as it
>> >>> worries users), but can become a real issue if an MMAP queue and a
>> >>> USERPTR queue are created by a driver with the same queue lock.
>> >>
>> >> Good point. Do you know any good solution to this problem?
>> >
>> > http://patchwork.linuxtv.org/patch/8455/
>> >
>> > It seems QBUF is safe, but PREPAREBUF isn't (both call __buf_prepare,
>> > which end up calling the memops get_userptr operation).
>> >
>> > I'll post a patch to fix it for PREPAREBUF. If I'm not mistaken, you can
>> > drop the down_read/up_read here.
>>
>> ok. Thanks for the link.
>>
>> >>>> +        vma = find_vma(current->mm, start);
>> >>>> +        if (!vma) {
>> >>>> +                printk(KERN_ERR "no vma for address %lu\n", start);
>> >>>> +                goto cleanup;
>> >>>> +        }
>> >>>> +
>> >>>> +        if (vma_is_io(vma)) {
>> >>>> +                unsigned long pfn;
>> >>>> +
>> >>>> +                if (vma->vm_end - start < n_pages * PAGE_SIZE) {
>> >>>> +                        printk(KERN_ERR "vma is too small\n");
>> >>>> +                        goto cleanup;
>> >>>> +                }
>> >>>> +
>> >>>> +                for (n = 0; n < n_pages; ++n, start += PAGE_SIZE) {
>> >>>> +                        ret = follow_pfn(vma, start, &pfn);
>> >>>> +                        if (ret) {
>> >>>> +                                printk(KERN_ERR "no page for address %lu\n",
>> >>>> +                                        start);
>> >>>> +                                goto cleanup;
>> >>>> +                        }
>> >>>> +                        pages[n] = pfn_to_page(pfn);
>> >>>> +                        get_page(pages[n]);
>> >>>
>> >>> This worries me. When the VM_PFNMAP flag is set, the memory pages are
>> >>> not backed by a struct page. Creating a struct page pointer out of it
>> >>> can be an acceptable hack (for instance to store a page in an
>> >>> scatterlist with sg_set_page() and then retrieve its physical address
>> >>> with sg_phys()), but you should not expect the struct page to be valid
>> >>> for anything else. Calling get_page() on it will likely crash.
>> >>
>> >> You are completetly right. This is the corner case where list of pages is
>> >> not a portable way of describing the memory.
>> >> Maybe pfn_valid should be used to check validity of the page (pfn)
>> >> before getting it?
>> >
>> > I think you should just drop the get_page() call. There's no page, so
>> > there's no need to get a reference count to it.
>>
>> The problem is that get_user_pages does call get_page. Not calling get_page
>> will break the symmetry between PFNMAP and non-PFNMAP buffers. Maybe
>> checking page validity before get_page/put_page is enough?
>
> PFNMAP and non-PFNMAP buffers are inherently different, so I don't see a
> problem in handling them differently. We will likely run into an issue though,
> with hardware such as the OMAP TILER, where the memory isn't backed by normal
> memory (and thus no struct page is present), but for which the target must be
> pinned somehow (in the case of the tiler, that would be a tiler mapping). I
> don't think we have an API to ask the kernel to pin a memory range regardless
> of how the memory is handled (system memory, reserved memory with PFNMAP,
> device mapping such as with the tiler, ...). This is an open issue. One
> possible solution is to deprecate USERPTR support for that kind of memory and
> use dma-buf instead.
>
>> > The VM_PFNMAP flag is mostly used with memory out of the kernel
>> > allocator's control if I'm not mistaken. The main use case I've seen is
>> > memory reserved at boot time and use as a frame buffer for instance. In
>> > that case the pages can't go away, as there no page in the first place.
>> >
>> > This won't fix the DMA SG problem though (see below).
>> >
>> >>>> +                }
>> >>>> +        } else {
>> >>>> +                n = get_user_pages(current, current->mm, start & PAGE_MASK,
>> >>>> +                        n_pages, write, 1, pages, NULL);
>> >>>> +                if (n != n_pages) {
>> >>>> +                        printk(KERN_ERR "got only %d of %d user pages\n",
>> >>>> +                                n, n_pages);
>> >>>> +                        goto cleanup;
>> >>>> +                }
>> >>>> +        }
>> >>>> +
>> >>>> +        *copy_vma = vb2_get_vma(vma);
>> >>>> +        if (!*copy_vma) {
>> >>>> +                printk(KERN_ERR "failed to copy vma\n");
>> >>>> +                ret = -ENOMEM;
>> >>>> +                goto cleanup;
>> >>>> +        }
>> >>>
>> >>> Do we really need to make a copy of the VMA ? The only reason why we
>> >>> store a pointer to it is to check the flags in vb2_dc_put_userptr(). We
>> >>> could store the flags instead and avoid vb2_get_dma()/vb2_put_dma()
>> >>> calls altogether.
>> >>
>> >> I remember that there was a very good reason of copying this vma
>> >> structure.
>> >> You caught me on 'cargo-cult' programming.
>> >> I will do some reverse engineering and try to answer it soon.
>> >
>> > OK :-) I'm not copying the VMA in the OMAP3 ISP driver, which is why this
>> > caught my eyes. If you find the reason why copying it is needed, please
>> > add a comment to the code.
>>
>> The reason of copying vma was that 'struct vma' has no reference counters.
>> Therefore it could be deleted after mm lock is freed, ending with freeing
>> its all pages belonging to vma. To prevent it, a copy of vma is created.
>> Notice that inside vb2_get_vma the callback open is called for original
>> vma, preventing memory from being released. On vb2_put_vma the
>> complementary close is called.
>
> Feel free to prove me wrong, but I think get_user_pages() is enough to prevent
> the pages from being freed, even if the VMA is deleted.
>
> However, there's one subtle issue that we will need to deal with when we will
> implement cache management. It took me a lot of time to debug and fix it when
> I was working on the OMAP3 ISP driver, so I'll explain it in the hope that
> someone will find it later when dealing with the same problems :-)
>
> When a buffer is queued, the OMAP3 ISP driver cleans up the cache using the
> userspace mapping addresses (for USERPTR buffers). This might be a bad thing,
> but that's the way we currently implement that.
>
> A prior get_user_pages() call will make sure pages are not freed, but due to
> optimization in the lockless memory management algorithms the userspace
> mapping can disappear: the kernel might consider that a page can be freed,
> remove its userspace mapping, and then find out that the page is locked. It
> will then move on without restoring the userspace mapping, which will be
> restored when the next page fault occurs.
>
> When cleaning the cache using the userspace mapping addresses, any page for
> which the userspace mapping has been removed will trigger a page fault. The
> page fault handler (do_page_fault() in arm/arch/mm/fault.c) will try to
> read_lock mmap_sem. If it fails, it will check if the page fault occured in
> userspace context, or from a known exception location. As neither condition is
> true, it will panic.
>
> The solution I found to this issue was to lock the VMA. This ensured that the
> userspace mapping would stay in place. See isp_video_buffer_lock_vma() in
> drivers/media/video/omap3isp/ispqueue.c. You could use a similar approach here
> if you want to ensure that userspace mappings are not removed, but once again
> I don't think that's needed (until we get to cache handling) as
> get_user_pages() will ensure that the pages are not freed.

I think the proper solution is to not use any user allocated memory and always
use dma-buf. Some evil process thread might unlock the vma behind your back
and you back to the original issue.

The linux memory management is not designed to easily allow use of user
allocated memory by a device to do dma to/from it, at least not for the usecase
where dma operation might happen over long period of time.

I guess some VMA change might help but this might also be hurt full and i am
not familiar enough with the whole memory management to venture a guess on
what kind if implication there is.

Cheers,
Jerome Glisse
_______________________________________________
dri-devel mailing list
dri-devel@xxxxxxxxxxxxxxxxxxxxx
http://lists.freedesktop.org/mailman/listinfo/dri-devel