Re: Poor performace on mmap reading arm64 on audio device

Michael Nazzareno Trimarchi <michael@xxxxxxxxxxxxxxxxxxxx> · Mon, 23 Nov 2020 16:15:50 +0100

Hi

On Mon, Nov 23, 2020 at 3:28 PM Takashi Iwai <tiwai@xxxxxxx> wrote:
>
> On Mon, 23 Nov 2020 15:19:34 +0100,
> Michael Nazzareno Trimarchi wrote:
> >
> > Hi
> >
> > On Mon, Nov 23, 2020 at 2:54 PM Takashi Iwai <tiwai@xxxxxxx> wrote:
> > >
> > > On Mon, 23 Nov 2020 14:44:52 +0100,
> > > Michael Nazzareno Trimarchi wrote:
> > > >
> > > > Hi
> > > >
> > > > On Mon, Nov 23, 2020 at 2:23 PM Takashi Iwai <tiwai@xxxxxxx> wrote:
> > > > >
> > > > > On Sat, 21 Nov 2020 10:40:04 +0100,
> > > > > Michael Nazzareno Trimarchi wrote:
> > > > > >
> > > > > > Hi all
> > > > > >
> > > > > > I'm trying to figure out how to increase performance on audio reading
> > > > > > using the mmap interface. Right now what I understand it's that
> > > > > > allocation comes from core/memalloc.c ops that allocate the memory for
> > > > > > dma under driver/dma.
> > > > > > The reference platform I have is an imx8mm and the allocation in arm64 is:
> > > > > >
> > > > > > 0xffff800011ff5000-0xffff800012005000          64K PTE       RW NX SHD
> > > > > > AF            UXN MEM/NORMAL-NC
> > > > > >
> > > > > > This is the reason that is allocated for dma interface.
> > > > > >
> > > > > > Now access linear on the multichannel interface the performance is bad
> > > > > > but worse if I try to access a channel a time on read.
> > > > > > So it looks like it is better to copy the block using memcpy on a
> > > > > > cached area and then operate on a single channel sample. If it's
> > > > > > correct what I'm saying the mmap_begin and mmap_commit
> > > > > > basically they don't do anything on cache level so the page mapping
> > > > > > and way is used is always the same. Can the interface be modified to
> > > > > > allow cache the area during read and restore in the commit
> > > > > > phase?
> > > > >
> > > > > The current API of the mmap for the sound ring-buffer is designed to
> > > > > allow concurrent accesses at any time in the minimalistic kernel-user
> > > > > context switching.  So the whole buffer is allocated as coherent and
> > > > > mmapped in a shot.  It's pretty efficient for architectures like x86,
> > > > > but has disadvantages on ARM, indeed.
> > > >
> > > > Each platform e/o architecture can specialize the mmap and declare the
> > > > area that is consistent in dma to me mapped
> > > > as no cache one
> > > >
> > > > vma->vm_page_prot = pgprot_cached(vma->vm_page_prot);
> > > >                 return remap_pfn_range(vma, vma->vm_start,
> > > >                                 vma->vm_end - vma->vm_start, vma->vm_page_prot);
> > > >
> > > > I have done it for testing purposes. This give an idea
> > > >
> > > > - read multi channel not sequentially took around 12% of the cpu with
> > > > mmap interface
> > > > - read multi channel use after a memcpy took around 6%
> > > > - read on a cached area took around 3%. I'm trying to figure out how
> > > > and when invalidate the area
> > > >
> > > > I have two use cases:
> > > > - write on the channels (no performance issue)
> > > > - read on channels
> > > >
> > > > Before reading I should only say that the cached area is not in sync
> > > > with memory. I think that supporting write use cases
> > > > makes little sense here.
> > >
> > > It's a necessary use case, unfortunately.  The reason we ended up with
> > > one device per direction for the PCM in many many years ago was that
> > > some applications need to write the buffers for marking even for the
> > > read.  So it can't be read-only, and it's supposed to be coherent on
> > > both read and write -- as long as keeping the current API usage.
> > >
> >
> > If I understand the allocation of the dma buffer depends on the direction.
> > Each device allocate one dma_buffer for tx device and one dma buffer
> > for rx device
> >
> > @@ -105,10 +105,16 @@ static int imx_pcm_preallocate_dma_buffer(struct
> > snd_pcm_substream *substream,
> >         size_t size = imx_pcm_hardware.buffer_bytes_max;
> >         int ret;
> >
> > -       ret = snd_dma_alloc_pages(SNDRV_DMA_TYPE_DEV_IRAM,
> > -                                 dev,
> > -                                 size,
> > -                                 &substream->dma_buffer);
> > +       if (substream->stream == SNDRV_PCM_STREAM_PLAYBACK)
> > +               ret = snd_dma_alloc_pages(SNDRV_DMA_TYPE_DEV,
> > +                                         dev,
> > +                                         size,
> > +                                         &substream->dma_buffer);
> > +       else
> > +               ret = snd_dma_alloc_pages(SNDRV_DMA_TYPE_DEV_IRAM,
> > +                                         dev,
> > +                                         size,
> > +                                         &substream->dma_buffer);
> >         if (ret)
> >                 return ret;
> >
> > Just a snippet from me, on some of my testing. How the physical memory
> > is used by the kernel is nothing to do in how the memory is then mapped
> > by the userspace to read from it. If I allocate it consistente in
> > snd_dma_alloc_pages
> > then you can let the user remap the area as cached in his own virtual mapping.
> > What I'm trying to said is that behind the scene everything is
> > consistent but the user
> > will get a cache line read during the first access and then he/she
> > will read from the cache.
> > Maybe is this assumption is totally wrong?
>
> Ah I see your point now.  I believe that this kind of mapping tweak
> could be done, but this doesn't satisfy the expectation of the mmap of
> the current sound API; e.g. dmix / dsnoop would fail.  So, if any,
> this should be an extension for some special usages.

Yes, I understand the dmix problem but I think that the writer is still
a single thread that mix the sources and why dsnoop is a problem?
Sorry I don't know the logic how they are implemented.

>
> My original idea was to totally go away from the coherent allocation
> and mapping, but just let dynamically syncing like other drivers do
> (e.g. net devices), aligned with mmap_begin/mmap_commit in alsa-lib.
>
Agree I have seen in how make the transition simple and this is the reason
that I was exploring the first one

Michael

>
> Takashi

-- 
Michael Nazzareno Trimarchi
Amarula Solutions BV
COO Co-Founder
Cruquiuskade 47 Amsterdam 1018 AM NL
T. +31(0)851119172
M. +39(0)3479132170
[`as] https://www.amarulasolutions.com