Re: [PATCH for v5.2] videobuf2-core.c: always reacquire USERPTR memory

Tomasz Figa <tfiga@xxxxxxxxxxxx> · Wed, 3 Jul 2019 18:08:10 +0900



On Thu, Jun 13, 2019 at 9:21 AM Nicolas Dufresne <nicolas@xxxxxxxxxxxx> wrote:
>
> Le mercredi 12 juin 2019 à 11:17 +0300, Laurent Pinchart a écrit :
> > Hi Nicolas,
> >
> > On Tue, Jun 11, 2019 at 08:09:13PM -0400, Nicolas Dufresne wrote:
> > > Le mardi 11 juin 2019 à 13:24 +0300, Laurent Pinchart a écrit :
> > > > On Fri, Jun 07, 2019 at 03:38:39PM -0400, Nicolas Dufresne wrote:
> > > > > Le vendredi 07 juin 2019 à 16:58 +0300, Laurent Pinchart a écrit :
> > > > > > On Fri, Jun 07, 2019 at 03:55:05PM +0200, Marek Szyprowski wrote:
> > > > > > > On 2019-06-07 15:40, Hans Verkuil wrote:
> > > > > > > > On 6/7/19 2:47 PM, Hans Verkuil wrote:
> > > > > > > > > On 6/7/19 2:23 PM, Hans Verkuil wrote:
> > > > > > > > > > On 6/7/19 2:14 PM, Marek Szyprowski wrote:
> > > > > > > > > > > On 2019-06-07 14:01, Hans Verkuil wrote:
> > > > > > > > > > > > On 6/7/19 1:16 PM, Laurent Pinchart wrote:
> > > > > > > > > > > > > Thank you for the patch.
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Fri, Jun 07, 2019 at 10:45:31AM +0200, Hans Verkuil wrote:
> > > > > > > > > > > > > > The __prepare_userptr() function made the incorrect assumption that if the
> > > > > > > > > > > > > > same user pointer was used as the last one for which memory was acquired, then
> > > > > > > > > > > > > > there was no need to re-acquire the memory. This assumption was never properly
> > > > > > > > > > > > > > tested, and after doing that it became clear that this was in fact wrong.
> > > > > > > > > > > > > Could you explain in the commit message why the assumption is not
> > > > > > > > > > > > > correct ?
> > > > > > > > > > > > You can free the memory, then allocate it again and you can get the same pointer,
> > > > > > > > > > > > even though it is not necessarily using the same physical pages for the memory
> > > > > > > > > > > > that the kernel is still using for it.
> > > > > > > > > > > >
> > > > > > > > > > > > Worse, you can free the memory, then allocate only half the memory you need and
> > > > > > > > > > > > get back the same pointer. vb2 wouldn't notice this. And it seems to work (since
> > > > > > > > > > > > the original mapping still remains), but this can corrupt userspace memory
> > > > > > > > > > > > causing the application to crash. It's not quite clear to me how the memory can
> > > > > > > > > > > > get corrupted. I don't know enough of those low-level mm internals to understand
> > > > > > > > > > > > the sequence of events.
> > > > > > > > > > > >
> > > > > > > > > > > > I have test code for v4l2-compliance available if someone wants to test this.
> > > > > > > > > > > I'm interested, I would really like to know what happens in the mm
> > > > > > > > > > > subsystem in such case.
> > > > > > > > > > Here it is:
> > > > > > > > > >
> > > > > > > > > > diff --git a/utils/v4l2-compliance/v4l2-test-buffers.cpp b/utils/v4l2-compliance/v4l2-test-buffers.cpp
> > > > > > > > > > index be606e48..9abf41da 100644
> > > > > > > > > > --- a/utils/v4l2-compliance/v4l2-test-buffers.cpp
> > > > > > > > > > +++ b/utils/v4l2-compliance/v4l2-test-buffers.cpp
> > > > > > > > > > @@ -797,7 +797,7 @@ int testReadWrite(struct node *node)
> > > > > > > > > >       return 0;
> > > > > > > > > >   }
> > > > > > > > > >
> > > > > > > > > > -static int captureBufs(struct node *node, const cv4l_queue &q,
> > > > > > > > > > +static int captureBufs(struct node *node, cv4l_queue &q,
> > > > > > > > > >               const cv4l_queue &m2m_q, unsigned frame_count, int pollmode,
> > > > > > > > > >               unsigned &capture_count)
> > > > > > > > > >   {
> > > > > > > > > > @@ -962,6 +962,21 @@ static int captureBufs(struct node *node, const cv4l_queue &q,
> > > > > > > > > >                               buf.s_flags(V4L2_BUF_FLAG_REQUEST_FD);
> > > > > > > > > >                               buf.s_request_fd(buf_req_fds[req_idx]);
> > > > > > > > > >                       }
> > > > > > > > > > +                     if (v4l_type_is_capture(buf.g_type()) && q.g_memory() == V4L2_MEMORY_USERPTR) {
> > > > > > > > > > +                             printf("\nidx: %d", buf.g_index());
> > > > > > > > > > +                             for (unsigned p = 0; p < q.g_num_planes(); p++) {
> > > > > > > > > > +                                     printf(" old buf[%d]: %p ", p, buf.g_userptr(p));
> > > > > > > > > > +                                     fflush(stdout);
> > > > > > > > > > +                                     free(buf.g_userptr(p));
> > > > > > > > > > +                                     void *m = calloc(1, q.g_length(p)/2);
> > > > > > > > > > +
> > > > > > > > > > +                                     fail_on_test(m == NULL);
> > > > > > > > > > +                                     q.s_userptr(buf.g_index(), p, m);
> > > > > > > > > > +                                     printf("new buf[%d]: %p", p, m);
> > > > > > > > > > +                                     buf.s_userptr(m, p);
> > > > > > > > > > +                             }
> > > > > > > > > > +                             printf("\n");
> > > > > > > > > > +                     }
> > > > > > > > > >                       fail_on_test(buf.qbuf(node, q));
> > > > > > > > > >                       fail_on_test(buf.g_flags() & V4L2_BUF_FLAG_DONE);
> > > > > > > > > >                       if (buf.g_flags() & V4L2_BUF_FLAG_REQUEST_FD) {
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Load the vivid driver and just run 'v4l2-compliance -s10' and you'll see:
> > > > > > > > > >
> > > > > > > > > > ...
> > > > > > > > > > Streaming ioctls:
> > > > > > > > > >          test read/write: OK
> > > > > > > > > >          test blocking wait: OK
> > > > > > > > > >          test MMAP (no poll): OK
> > > > > > > > > >          test MMAP (select): OK
> > > > > > > > > >          test MMAP (epoll): OK
> > > > > > > > > >          Video Capture: Frame #000
> > > > > > > > > > idx: 0 old buf[0]: 0x7f71c6e7c010 new buf[0]: 0x7f71c6eb4010
> > > > > > > > > >          Video Capture: Frame #001
> > > > > > > > > > idx: 1 old buf[0]: 0x7f71c6e0b010 new buf[0]: 0x7f71c6e7b010
> > > > > > > > > >          Video Capture: Frame #002
> > > > > > > > > > idx: 0 old buf[0]: 0x7f71c6eb4010 free(): invalid pointer
> > > > > > > > > > Aborted
> > > > > > > > > To clarify: two full size buffers are allocated and queued (that happens in setupUserPtr()),
> > > > > > > > > then streaming starts and captureBufs is called which basically just calls dqbuf
> > > > > > > > > and qbuf.
> > > > > > > > >
> > > > > > > > > Tomasz pointed out that all the pointers in this log are actually different. That's
> > > > > > > > > correct, but here is a log where the old and new buf ptr are the same:
> > > > > > > > >
> > > > > > > > > Streaming ioctls:
> > > > > > > > >          test read/write: OK
> > > > > > > > >          test blocking wait: OK
> > > > > > > > >          test MMAP (no poll): OK
> > > > > > > > >          test MMAP (select): OK
> > > > > > > > >          test MMAP (epoll): OK
> > > > > > > > >          Video Capture: Frame #000
> > > > > > > > > idx: 0 old buf[0]: 0x7f1094e16010 new buf[0]: 0x7f1094e4e010
> > > > > > > > >          Video Capture: Frame #001
> > > > > > > > > idx: 1 old buf[0]: 0x7f1094da5010 new buf[0]: 0x7f1094e15010
> > > > > > > > >          Video Capture: Frame #002
> > > > > > > > > idx: 0 old buf[0]: 0x7f1094e4e010 new buf[0]: 0x7f1094e4e010
> > > > > > > > >          Video Capture: Frame #003
> > > > > > > > > idx: 1 old buf[0]: 0x7f1094e15010 free(): invalid pointer
> > > > > > > > > Aborted
> > > > > > > > >
> > > > > > > > > It's weird that the first log fails that way: if the pointers are different,
> > > > > > > > > then vb2 will call get_userptr and it should discover that the buffer isn't
> > > > > > > > > large enough, causing qbuf to fail. That doesn't seem to happen.
> > > > > > > > I think that the reason for this corruption is that the memory pool used
> > > > > > > > by glibc is now large enough for vb2 to think it can map the full length
> > > > > > > > of the user pointer into memory, even though only the first half is actually
> > > > > > > > from the buffer that's allocated. When you capture a frame you just overwrite
> > > > > > > > a random part of the application's memory pool, causing this invalid pointer.
> > > > > > > >
> > > > > > > > But that's a matter of garbage in, garbage out. So that's not the issue here.
> > > > > > > >
> > > > > > > > The real question is what happens when you free the old buffer, allocate a
> > > > > > > > new buffer, end up with the same userptr, but it's using one or more different
> > > > > > > > pages for its memory compared to the mapping that the kernel uses.
> > > > > > > >
> > > > > > > > I managed to reproduce this with v4l2-ctl:
> > > > > > > >
> > > > > > > > diff --git a/utils/v4l2-ctl/v4l2-ctl-streaming.cpp b/utils/v4l2-ctl/v4l2-ctl-streaming.cpp
> > > > > > > > index 28b2b3b9..8f2ed9b5 100644
> > > > > > > > --- a/utils/v4l2-ctl/v4l2-ctl-streaming.cpp
> > > > > > > > +++ b/utils/v4l2-ctl/v4l2-ctl-streaming.cpp
> > > > > > > > @@ -1422,6 +1422,24 @@ static int do_handle_cap(cv4l_fd &fd, cv4l_queue &q, FILE *fout, int *index,
> > > > > > > >                    * has the size that fits the old resolution and might not
> > > > > > > >                    * fit to the new one.
> > > > > > > >                    */
> > > > > > > > +         if (q.g_memory() == V4L2_MEMORY_USERPTR) {
> > > > > > > > +                 printf("\nidx: %d", buf.g_index());
> > > > > > > > +                 for (unsigned p = 0; p < q.g_num_planes(); p++) {
> > > > > > > > +                         unsigned *pb = (unsigned *)buf.g_userptr(p);
> > > > > > > > +                         printf(" old buf[%d]: %p first pixel: 0x%x", p, buf.g_userptr(p), *pb);
> > > > > > > > +                         fflush(stdout);
> > > > > > > > +                         free(buf.g_userptr(p));
> > > > > > > > +                         void *m = calloc(1, q.g_length(p));
> > > > > > > > +
> > > > > > > > +                         if (m == NULL)
> > > > > > > > +                                 return QUEUE_ERROR;
> > > > > > > > +                         q.s_userptr(buf.g_index(), p, m);
> > > > > > > > +                         if (m == buf.g_userptr(p))
> > > > > > > > +                                 printf(" identical new buf");
> > > > > > > > +                         buf.s_userptr(m, p);
> > > > > > > > +                 }
> > > > > > > > +                 printf("\n");
> > > > > > > > +         }
> > > > > > > >                   if (fd.qbuf(buf) && errno != EINVAL) {
> > > > > > > >                           fprintf(stderr, "%s: qbuf error\n", __func__);
> > > > > > > >                           return QUEUE_ERROR;
> > > > > > > >
> > > > > > > >
> > > > > > > > Load vivid, setup a pure white test pattern:
> > > > > > > >
> > > > > > > > v4l2-ctl -c test_pattern=6
> > > > > > > >
> > > > > > > > Now run v4l2-ctl --stream-user and you'll see:
> > > > > > > >
> > > > > > > > idx: 0 old buf[0]: 0x7f91551cb010 first pixel: 0x80ea80ea identical new buf
> > > > > > > > <
> > > > > > > > idx: 1 old buf[0]: 0x7f915515a010 first pixel: 0x80ea80ea identical new buf
> > > > > > > > <
> > > > > > > > idx: 2 old buf[0]: 0x7f91550e9010 first pixel: 0x80ea80ea identical new buf
> > > > > > > > <
> > > > > > > > idx: 3 old buf[0]: 0x7f9155078010 first pixel: 0x80ea80ea identical new buf
> > > > > > > > <
> > > > > > > > idx: 0 old buf[0]: 0x7f91551cb010 first pixel: 0x0 identical new buf
> > > > > > > > <
> > > > > > > > idx: 1 old buf[0]: 0x7f915515a010 first pixel: 0x0 identical new buf
> > > > > > > > < 5.00 fps
> > > > > > > >
> > > > > > > > idx: 2 old buf[0]: 0x7f91550e9010 first pixel: 0x0 identical new buf
> > > > > > > > <
> > > > > > > > idx: 3 old buf[0]: 0x7f9155078010 first pixel: 0x0 identical new buf
> > > > > > > >
> > > > > > > > The first four dequeued buffers are filled with data, after that the
> > > > > > > > returned buffer is empty because vivid is actually writing to different
> > > > > > > > memory pages.
> > > > > > > >
> > > > > > > > With this patch the first pixel is always non-zero.
> > > > > > >
> > > > > > > Good catch. The question is weather we treat that as undefined behavior
> > > > > > > and keep the optimization for 'good applications' or assume that every
> > > > > > > broken userspace code has to be properly handled.
> > > > > >
> > > > > > Given how long we've been saying that USERPTR should be replaced by
> > > > > > DMABUF, I would consider that any userspace code using USERPTR is broken
> > > > > > :-) One could however question whether we were effective at getting that
> > > > > > message across...
> > > > >
> > > > > Just a reminder that DMABuf is not a replacement for USERPTR. It only
> > > > > cover a subset in absence of an allocater for it. There is no clean way
> > > > > to allocate a DMAbuf. Notably, memfds (which could have filled the gap)
> > > > > are not DMABuf, even though they are they are similar to the buffers
> > > > > allocated by vivid or uvcvideo.
> > > >
> > > > You always have the option to use MMAP to allocate buffers on the V4L2
> > > > device. What prevents you from doing so and forces usage of USERPTR ?
> > >
> > > If you use MMAP on one v4l2 device, how do you import that into another
> > > v4l2 device ?
> >
> > You can simply export the MMAP buffers on the V4L2 device that has
> > allocated them, and use DMABUF on the importing device.
> >
> > > Now, let's say your source is not a v4l2 device, and uses virtual
> > > memory, how does DMABuf replaces such a use case if you want to avoid
> > > copies and you know your HW can support fast usage of these randomly
> > > allocated buffers ?
> >
> > For this use case you should allocate buffers on the sink, mmap them,
> > and use the mapped memory on the source side. I agree that not all
> > sources may support this mode of operation, but that's a design issue
> > with the source. If we had a dmabuf allocator your problem wouldn't be
> > solved, as the source would still need to be modified to use it.
>
> I don't think any of this reflection covers the surface of the
> restrictions that V4L2 combined queue/allocation impose on userspace.
> One of our very common scenarios requires to capture from one source,
> and zero-copy that toward multiple sink. One source could be USB driven
> (non v4l2), or network socket. Allocating from one random sink isn't
> really working, in fact it would lead to ebusy prior to the orphaning
> mechanism that was added recently. Since as long as one buffer of a
> device is still active, the driver (and the HW behind) cannot be used
> anymore.
>
> Over your N sinks, you maybe have zero-copy for some of them, even if
> it's foreign allocation, while others will just fallback to mmap/copy,
> and that's could be all right for a specific application. But as the
> source does not always have a "dmabuf", USERPTR remains the only option
> one could try. I don't know how memfd works, but maybe memfd should be
> a memory type in replacement to USERPTR ? I'm really not sure what the
> replacement, but I'm quite clear that there is no zero-copy replacement
> for it atm.

We have udmabuf now:
https://elixir.bootlin.com/linux/v5.2-rc7/source/drivers/dma-buf/udmabuf.c