Hi All, On 08/08/18 13:45, Keiichi Watanabe wrote: > Hi Laurent, Kieran, Tomasz, > > Thank you for reviews and suggestions. > I want to do additional measurements for improving the performance. > > Let me clarify my understanding: > Currently, if the platform doesn't support coherent-DMA (e.g. ARM), > urb_buffer is allocated by usb_alloc_coherent with > URB_NO_TRANSFER_DMA_MAP flag instead of using kmalloc. > This is because we want to avoid frequent DMA mappings, which are > generally expensive. > However, memories allocated in this way are not cached. > > So, we wonder if using usb_alloc_coherent is really fast. > In other words, we want to know which is better: > "No DMA mapping/Uncached memory" v.s. "Frequent DMA mapping/Cached memory". > > For this purpose, I'm planning to measure performance on ARM > Chromebooks in the following conditions: > 1. Current implementation with Kieran's patches > 2. 1. + my patch > 3. Use kmalloc instead > > 1 and 2 are the same conditions I reported in the first mail on this thread. > For condition 3, I only have to add "#define CONFIG_DMA_NONCOHERENT" > at the beginning of uvc_video.c. I'd be interested in numbers/performances both with and without my async if possible too. The async path can be easily disabled temporarily with the following change: (perhaps this should be a module option?) diff --git a/drivers/media/usb/uvc/uvc_video.c b/drivers/media/usb/uvc/uvc_video.c index 8bb6e90f3483..f9fbdc9bfa4b 100644 --- a/drivers/media/usb/uvc/uvc_video.c +++ b/drivers/media/usb/uvc/uvc_video.c @@ -1505,7 +1505,8 @@ static void uvc_video_complete(struct urb *urb) } INIT_WORK(&uvc_urb->work, uvc_video_copy_data_work); - queue_work(stream->async_wq, &uvc_urb->work); +// queue_work(stream->async_wq, &uvc_urb->work); + uvc_video_copy_data_work(&uvc_urb->work); } /* I do suspect that even with cached buffers, it's probably likely we should still consider the async patches to move the memcopy out of interrupt context. -- Regards Kieran > > Does this plan sound reasonable? > > Best regards, > Keiichi > On Wed, Aug 8, 2018 at 5:42 PM Laurent Pinchart > <laurent.pinchart@xxxxxxxxxxxxxxxx> wrote: >> >> Hi Tomasz, >> >> On Wednesday, 8 August 2018 07:08:59 EEST Tomasz Figa wrote: >>> On Tue, Jul 31, 2018 at 1:00 AM Laurent Pinchart wrote: >>>> On Wednesday, 27 June 2018 13:34:08 EEST Keiichi Watanabe wrote: >>>>> On some platforms with non-coherent DMA (e.g. ARM), USB drivers use >>>>> uncached memory allocation methods. In such situations, it sometimes >>>>> takes a long time to access URB buffers. This can be a cause of video >>>>> flickering problems if a resolution is high and a USB controller has >>>>> a very tight time limit. (e.g. dwc2) To avoid this problem, we copy >>>>> header data from (uncached) URB buffer into (cached) local buffer. >>>>> >>>>> This change should make the elapsed time of the interrupt handler >>>>> shorter on platforms with non-coherent DMA. We measured the elapsed >>>>> time of each callback of uvc_video_complete without/with this patch >>>>> while capturing Full HD video in >>>>> https://webrtc.github.io/samples/src/content/getusermedia/resolution/. >>>>> I tested it on the top of Kieran Bingham's Asynchronous UVC series >>>>> https://www.mail-archive.com/linux-media@xxxxxxxxxxxxxxx/msg128359.html. >>>>> The test device was Jerry Chromebook (RK3288) with Logitech Brio 4K. >>>>> I collected data for 5 seconds. (There were around 480 callbacks in >>>>> this case.) The following result shows that this patch makes >>>>> uvc_video_complete about 2x faster. >>>>> >>>>> | average | median | min | max | standard deviation >>>>> w/o caching| 45319ns | 40250ns | 33834ns | 142625ns| 16611ns >>>>> w/ caching| 20620ns | 19250ns | 12250ns | 56583ns | 6285ns >>>>> >>>>> In addition, we confirmed that this patch doesn't make it worse on >>>>> coherent DMA architecture by performing the same measurements on a >>>>> Broadwell Chromebox with the same camera. >>>>> >>>>> | average | median | min | max | standard deviation >>>>> w/o caching| 21026ns | 21424ns | 12263ns | 23956ns | 1932ns >>>>> w/ caching| 20728ns | 20398ns | 8922ns | 45120ns | 3368ns >>>> >>>> This is very interesting, and it seems related to https:// >>>> patchwork.kernel.org/patch/10468937/. You might have seen that discussion >>>> as you got CC'ed at some point. >>>> >>>> I wonder whether performances couldn't be further improved by allocating >>>> the URB buffers cached, as that would speed up the memcpy() as well. Have >>>> you tested that by any chance ? >>> >>> We haven't measure it, but the issue being solved here was indeed >>> significantly reduced by using cached URB buffers, even without >>> Kieran's async series. After we discovered the latter, we just >>> backported it and decided to further tweak the last remaining bit, to >>> avoid playing too much with the DMA API in code used in production on >>> several different platforms (including both ARM and x86). >>> >>> If you think we could change the driver to use cached buffers instead >>> (as the pwc driver mentioned in another thread), I wouldn't have >>> anything against it obviously. >> >> I think there's a chance that performances could be further improved. >> Furthermore, it would lean to simpler code as we wouldn't need to deal with >> caching headers manually. I would however like to see numbers before making a >> decision. >> >> -- >> Regards, >> >> Laurent Pinchart >> >> >> -- Regards -- Kieran