Hi Kieran, On Wed, Aug 8, 2018 at 10:07 PM Kieran Bingham <kieran.bingham@xxxxxxxxxxxxxxxx> wrote: > > Hi All, > > On 08/08/18 13:45, Keiichi Watanabe wrote: > > Hi Laurent, Kieran, Tomasz, > > > > Thank you for reviews and suggestions. > > I want to do additional measurements for improving the performance. > > > > Let me clarify my understanding: > > Currently, if the platform doesn't support coherent-DMA (e.g. ARM), > > urb_buffer is allocated by usb_alloc_coherent with > > URB_NO_TRANSFER_DMA_MAP flag instead of using kmalloc. > > This is because we want to avoid frequent DMA mappings, which are > > generally expensive. > > However, memories allocated in this way are not cached. > > > > So, we wonder if using usb_alloc_coherent is really fast. > > In other words, we want to know which is better: > > "No DMA mapping/Uncached memory" v.s. "Frequent DMA mapping/Cached memory". > > > > For this purpose, I'm planning to measure performance on ARM > > Chromebooks in the following conditions: > > 1. Current implementation with Kieran's patches > > 2. 1. + my patch > > 3. Use kmalloc instead > > > > 1 and 2 are the same conditions I reported in the first mail on this thread. > > For condition 3, I only have to add "#define CONFIG_DMA_NONCOHERENT" > > at the beginning of uvc_video.c. > > I'd be interested in numbers/performances both with and without my async > if possible too. > > The async path can be easily disabled temporarily with the following > change: (perhaps this should be a module option?) > > diff --git a/drivers/media/usb/uvc/uvc_video.c > b/drivers/media/usb/uvc/uvc_video.c > index 8bb6e90f3483..f9fbdc9bfa4b 100644 > --- a/drivers/media/usb/uvc/uvc_video.c > +++ b/drivers/media/usb/uvc/uvc_video.c > @@ -1505,7 +1505,8 @@ static void uvc_video_complete(struct urb *urb) > } > > INIT_WORK(&uvc_urb->work, uvc_video_copy_data_work); > - queue_work(stream->async_wq, &uvc_urb->work); > +// queue_work(stream->async_wq, &uvc_urb->work); > + uvc_video_copy_data_work(&uvc_urb->work); > } > > /* > > > I do suspect that even with cached buffers, it's probably likely we > should still consider the async patches to move the memcopy out of > interrupt context. > It sounds better to check. I'll do tests with/without your async patches in both cached/uncached memory allocation conditions. Best regards, Keiichi > -- > Regards > > Kieran > > > > > > > > Does this plan sound reasonable? > > > > Best regards, > > Keiichi > > On Wed, Aug 8, 2018 at 5:42 PM Laurent Pinchart > > <laurent.pinchart@xxxxxxxxxxxxxxxx> wrote: > >> > >> Hi Tomasz, > >> > >> On Wednesday, 8 August 2018 07:08:59 EEST Tomasz Figa wrote: > >>> On Tue, Jul 31, 2018 at 1:00 AM Laurent Pinchart wrote: > >>>> On Wednesday, 27 June 2018 13:34:08 EEST Keiichi Watanabe wrote: > >>>>> On some platforms with non-coherent DMA (e.g. ARM), USB drivers use > >>>>> uncached memory allocation methods. In such situations, it sometimes > >>>>> takes a long time to access URB buffers. This can be a cause of video > >>>>> flickering problems if a resolution is high and a USB controller has > >>>>> a very tight time limit. (e.g. dwc2) To avoid this problem, we copy > >>>>> header data from (uncached) URB buffer into (cached) local buffer. > >>>>> > >>>>> This change should make the elapsed time of the interrupt handler > >>>>> shorter on platforms with non-coherent DMA. We measured the elapsed > >>>>> time of each callback of uvc_video_complete without/with this patch > >>>>> while capturing Full HD video in > >>>>> https://webrtc.github.io/samples/src/content/getusermedia/resolution/. > >>>>> I tested it on the top of Kieran Bingham's Asynchronous UVC series > >>>>> https://www.mail-archive.com/linux-media@xxxxxxxxxxxxxxx/msg128359.html. > >>>>> The test device was Jerry Chromebook (RK3288) with Logitech Brio 4K. > >>>>> I collected data for 5 seconds. (There were around 480 callbacks in > >>>>> this case.) The following result shows that this patch makes > >>>>> uvc_video_complete about 2x faster. > >>>>> > >>>>> | average | median | min | max | standard deviation > >>>>> w/o caching| 45319ns | 40250ns | 33834ns | 142625ns| 16611ns > >>>>> w/ caching| 20620ns | 19250ns | 12250ns | 56583ns | 6285ns > >>>>> > >>>>> In addition, we confirmed that this patch doesn't make it worse on > >>>>> coherent DMA architecture by performing the same measurements on a > >>>>> Broadwell Chromebox with the same camera. > >>>>> > >>>>> | average | median | min | max | standard deviation > >>>>> w/o caching| 21026ns | 21424ns | 12263ns | 23956ns | 1932ns > >>>>> w/ caching| 20728ns | 20398ns | 8922ns | 45120ns | 3368ns > >>>> > >>>> This is very interesting, and it seems related to https:// > >>>> patchwork.kernel.org/patch/10468937/. You might have seen that discussion > >>>> as you got CC'ed at some point. > >>>> > >>>> I wonder whether performances couldn't be further improved by allocating > >>>> the URB buffers cached, as that would speed up the memcpy() as well. Have > >>>> you tested that by any chance ? > >>> > >>> We haven't measure it, but the issue being solved here was indeed > >>> significantly reduced by using cached URB buffers, even without > >>> Kieran's async series. After we discovered the latter, we just > >>> backported it and decided to further tweak the last remaining bit, to > >>> avoid playing too much with the DMA API in code used in production on > >>> several different platforms (including both ARM and x86). > >>> > >>> If you think we could change the driver to use cached buffers instead > >>> (as the pwc driver mentioned in another thread), I wouldn't have > >>> anything against it obviously. > >> > >> I think there's a chance that performances could be further improved. > >> Furthermore, it would lean to simpler code as we wouldn't need to deal with > >> caching headers manually. I would however like to see numbers before making a > >> decision. > >> > >> -- > >> Regards, > >> > >> Laurent Pinchart > >> > >> > >> > > -- > Regards > -- > Kieran