Re: [RFC PATCH v1] media: uvcvideo: Cache URB header data before processing

Laurent Pinchart <laurent.pinchart@xxxxxxxxxxxxxxxx> · Wed, 08 Aug 2018 11:42:44 +0300

Hi Tomasz,

On Wednesday, 8 August 2018 07:08:59 EEST Tomasz Figa wrote:
> On Tue, Jul 31, 2018 at 1:00 AM Laurent Pinchart wrote:
> > On Wednesday, 27 June 2018 13:34:08 EEST Keiichi Watanabe wrote:
> >> On some platforms with non-coherent DMA (e.g. ARM), USB drivers use
> >> uncached memory allocation methods. In such situations, it sometimes
> >> takes a long time to access URB buffers.  This can be a cause of video
> >> flickering problems if a resolution is high and a USB controller has
> >> a very tight time limit. (e.g. dwc2) To avoid this problem, we copy
> >> header data from (uncached) URB buffer into (cached) local buffer.
> >> 
> >> This change should make the elapsed time of the interrupt handler
> >> shorter on platforms with non-coherent DMA. We measured the elapsed
> >> time of each callback of uvc_video_complete without/with this patch
> >> while capturing Full HD video in
> >> https://webrtc.github.io/samples/src/content/getusermedia/resolution/.
> >> I tested it on the top of Kieran Bingham's Asynchronous UVC series
> >> https://www.mail-archive.com/linux-media@xxxxxxxxxxxxxxx/msg128359.html.
> >> The test device was Jerry Chromebook (RK3288) with Logitech Brio 4K.
> >> I collected data for 5 seconds. (There were around 480 callbacks in
> >> this case.) The following result shows that this patch makes
> >> uvc_video_complete about 2x faster.
> >> 
> >>            | average | median  | min     | max     | standard deviation
> >> w/o caching| 45319ns | 40250ns | 33834ns | 142625ns| 16611ns
> >> w/  caching| 20620ns | 19250ns | 12250ns | 56583ns | 6285ns
> >> 
> >> In addition, we confirmed that this patch doesn't make it worse on
> >> coherent DMA architecture by performing the same measurements on a
> >> Broadwell Chromebox with the same camera.
> >> 
> >>            | average | median  | min     | max     | standard deviation
> >> w/o caching| 21026ns | 21424ns | 12263ns | 23956ns | 1932ns
> >> w/  caching| 20728ns | 20398ns |  8922ns | 45120ns | 3368ns
> > 
> > This is very interesting, and it seems related to https://
> > patchwork.kernel.org/patch/10468937/. You might have seen that discussion
> > as you got CC'ed at some point.
> > 
> > I wonder whether performances couldn't be further improved by allocating
> > the URB buffers cached, as that would speed up the memcpy() as well. Have
> > you tested that by any chance ?
> 
> We haven't measure it, but the issue being solved here was indeed
> significantly reduced by using cached URB buffers, even without
> Kieran's async series. After we discovered the latter, we just
> backported it and decided to further tweak the last remaining bit, to
> avoid playing too much with the DMA API in code used in production on
> several different platforms (including both ARM and x86).
> 
> If you think we could change the driver to use cached buffers instead
> (as the pwc driver mentioned in another thread), I wouldn't have
> anything against it obviously.

I think there's a chance that performances could be further improved. 
Furthermore, it would lean to simpler code as we wouldn't need to deal with 
caching headers manually. I would however like to see numbers before making a 
decision.

-- 
Regards,

Laurent Pinchart