Hi On Mon, Nov 23, 2020 at 3:28 PM Takashi Iwai <tiwai@xxxxxxx> wrote: > > On Mon, 23 Nov 2020 15:19:34 +0100, > Michael Nazzareno Trimarchi wrote: > > > > Hi > > > > On Mon, Nov 23, 2020 at 2:54 PM Takashi Iwai <tiwai@xxxxxxx> wrote: > > > > > > On Mon, 23 Nov 2020 14:44:52 +0100, > > > Michael Nazzareno Trimarchi wrote: > > > > > > > > Hi > > > > > > > > On Mon, Nov 23, 2020 at 2:23 PM Takashi Iwai <tiwai@xxxxxxx> wrote: > > > > > > > > > > On Sat, 21 Nov 2020 10:40:04 +0100, > > > > > Michael Nazzareno Trimarchi wrote: > > > > > > > > > > > > Hi all > > > > > > > > > > > > I'm trying to figure out how to increase performance on audio reading > > > > > > using the mmap interface. Right now what I understand it's that > > > > > > allocation comes from core/memalloc.c ops that allocate the memory for > > > > > > dma under driver/dma. > > > > > > The reference platform I have is an imx8mm and the allocation in arm64 is: > > > > > > > > > > > > 0xffff800011ff5000-0xffff800012005000 64K PTE RW NX SHD > > > > > > AF UXN MEM/NORMAL-NC > > > > > > > > > > > > This is the reason that is allocated for dma interface. > > > > > > > > > > > > Now access linear on the multichannel interface the performance is bad > > > > > > but worse if I try to access a channel a time on read. > > > > > > So it looks like it is better to copy the block using memcpy on a > > > > > > cached area and then operate on a single channel sample. If it's > > > > > > correct what I'm saying the mmap_begin and mmap_commit > > > > > > basically they don't do anything on cache level so the page mapping > > > > > > and way is used is always the same. Can the interface be modified to > > > > > > allow cache the area during read and restore in the commit > > > > > > phase? > > > > > > > > > > The current API of the mmap for the sound ring-buffer is designed to > > > > > allow concurrent accesses at any time in the minimalistic kernel-user > > > > > context switching. So the whole buffer is allocated as coherent and > > > > > mmapped in a shot. It's pretty efficient for architectures like x86, > > > > > but has disadvantages on ARM, indeed. > > > > > > > > Each platform e/o architecture can specialize the mmap and declare the > > > > area that is consistent in dma to me mapped > > > > as no cache one > > > > > > > > vma->vm_page_prot = pgprot_cached(vma->vm_page_prot); > > > > return remap_pfn_range(vma, vma->vm_start, > > > > vma->vm_end - vma->vm_start, vma->vm_page_prot); > > > > > > > > I have done it for testing purposes. This give an idea > > > > > > > > - read multi channel not sequentially took around 12% of the cpu with > > > > mmap interface > > > > - read multi channel use after a memcpy took around 6% > > > > - read on a cached area took around 3%. I'm trying to figure out how > > > > and when invalidate the area > > > > > > > > I have two use cases: > > > > - write on the channels (no performance issue) > > > > - read on channels > > > > > > > > Before reading I should only say that the cached area is not in sync > > > > with memory. I think that supporting write use cases > > > > makes little sense here. > > > > > > It's a necessary use case, unfortunately. The reason we ended up with > > > one device per direction for the PCM in many many years ago was that > > > some applications need to write the buffers for marking even for the > > > read. So it can't be read-only, and it's supposed to be coherent on > > > both read and write -- as long as keeping the current API usage. > > > > > > > If I understand the allocation of the dma buffer depends on the direction. > > Each device allocate one dma_buffer for tx device and one dma buffer > > for rx device > > > > @@ -105,10 +105,16 @@ static int imx_pcm_preallocate_dma_buffer(struct > > snd_pcm_substream *substream, > > size_t size = imx_pcm_hardware.buffer_bytes_max; > > int ret; > > > > - ret = snd_dma_alloc_pages(SNDRV_DMA_TYPE_DEV_IRAM, > > - dev, > > - size, > > - &substream->dma_buffer); > > + if (substream->stream == SNDRV_PCM_STREAM_PLAYBACK) > > + ret = snd_dma_alloc_pages(SNDRV_DMA_TYPE_DEV, > > + dev, > > + size, > > + &substream->dma_buffer); > > + else > > + ret = snd_dma_alloc_pages(SNDRV_DMA_TYPE_DEV_IRAM, > > + dev, > > + size, > > + &substream->dma_buffer); > > if (ret) > > return ret; > > > > Just a snippet from me, on some of my testing. How the physical memory > > is used by the kernel is nothing to do in how the memory is then mapped > > by the userspace to read from it. If I allocate it consistente in > > snd_dma_alloc_pages > > then you can let the user remap the area as cached in his own virtual mapping. > > What I'm trying to said is that behind the scene everything is > > consistent but the user > > will get a cache line read during the first access and then he/she > > will read from the cache. > > Maybe is this assumption is totally wrong? > > Ah I see your point now. I believe that this kind of mapping tweak > could be done, but this doesn't satisfy the expectation of the mmap of > the current sound API; e.g. dmix / dsnoop would fail. So, if any, > this should be an extension for some special usages. Yes, I understand the dmix problem but I think that the writer is still a single thread that mix the sources and why dsnoop is a problem? Sorry I don't know the logic how they are implemented. > > My original idea was to totally go away from the coherent allocation > and mapping, but just let dynamically syncing like other drivers do > (e.g. net devices), aligned with mmap_begin/mmap_commit in alsa-lib. > Agree I have seen in how make the transition simple and this is the reason that I was exploring the first one Michael > > Takashi -- Michael Nazzareno Trimarchi Amarula Solutions BV COO Co-Founder Cruquiuskade 47 Amsterdam 1018 AM NL T. +31(0)851119172 M. +39(0)3479132170 [`as] https://www.amarulasolutions.com