Re: Poor performace on mmap reading arm64 on audio device

Michael Nazzareno Trimarchi <michael@xxxxxxxxxxxxxxxxxxxx> · Mon, 23 Nov 2020 14:44:52 +0100

Hi

On Mon, Nov 23, 2020 at 2:23 PM Takashi Iwai <tiwai@xxxxxxx> wrote:
>
> On Sat, 21 Nov 2020 10:40:04 +0100,
> Michael Nazzareno Trimarchi wrote:
> >
> > Hi all
> >
> > I'm trying to figure out how to increase performance on audio reading
> > using the mmap interface. Right now what I understand it's that
> > allocation comes from core/memalloc.c ops that allocate the memory for
> > dma under driver/dma.
> > The reference platform I have is an imx8mm and the allocation in arm64 is:
> >
> > 0xffff800011ff5000-0xffff800012005000          64K PTE       RW NX SHD
> > AF            UXN MEM/NORMAL-NC
> >
> > This is the reason that is allocated for dma interface.
> >
> > Now access linear on the multichannel interface the performance is bad
> > but worse if I try to access a channel a time on read.
> > So it looks like it is better to copy the block using memcpy on a
> > cached area and then operate on a single channel sample. If it's
> > correct what I'm saying the mmap_begin and mmap_commit
> > basically they don't do anything on cache level so the page mapping
> > and way is used is always the same. Can the interface be modified to
> > allow cache the area during read and restore in the commit
> > phase?
>
> The current API of the mmap for the sound ring-buffer is designed to
> allow concurrent accesses at any time in the minimalistic kernel-user
> context switching.  So the whole buffer is allocated as coherent and
> mmapped in a shot.  It's pretty efficient for architectures like x86,
> but has disadvantages on ARM, indeed.

Each platform e/o architecture can specialize the mmap and declare the
area that is consistent in dma to me mapped
as no cache one

vma->vm_page_prot = pgprot_cached(vma->vm_page_prot);
                return remap_pfn_range(vma, vma->vm_start,
                                vma->vm_end - vma->vm_start, vma->vm_page_prot);

I have done it for testing purposes. This give an idea

- read multi channel not sequentially took around 12% of the cpu with
mmap interface
- read multi channel use after a memcpy took around 6%
- read on a cached area took around 3%. I'm trying to figure out how
and when invalidate the area

I have two use cases:
- write on the channels (no performance issue)
- read on channels

Before reading I should only say that the cached area is not in sync
with memory. I think that supporting write use cases
makes little sense here.

>
> The mmap_begin and mmap_commit are the concepts in the alsa-lib side
> for supporting the plugins better, and they doesn't represent kernel
> ABI.  So, this extension would be needed at first, and the memory
> allocation mechanism has to be changed as well.  Last but not least,

Are you sure about memory allocation, or just memory mapping?

> the concurrency has to be reconsidered if this approach is taken.
>

Yes I know that is a big problem anyway. I don't have a big idea how solve it

Michael

> That said, it's possible in theory, but practically no trivial task.
>
>
> thanks,
>
> Takashi

-- 
Michael Nazzareno Trimarchi
Amarula Solutions BV
COO Co-Founder
Cruquiuskade 47 Amsterdam 1018 AM NL
T. +31(0)851119172
M. +39(0)3479132170
[`as] https://www.amarulasolutions.com