On 2020/08/13 19:23: Richard Leitner <richard.leitner@xxxxxxxxxxx> wrote: > Hi, > we've found a race condition with the PCM on the i.MX6 which results in an > -EIO for the SNDRV_PCM_IOCTL_READI_FRAMES ioctl after an -EPIPE (XRUN). > > A possible reproduction may look like the following reduced call graph during a > PCM capture: > > us -> ioctl(SNDRV_PCM_IOCTL_READI_FRAMES) > - wait_for_avail() > - schedule_timeout() > -> snd_pcm_update_hw_ptr0() > - snd_pcm_update_state: EPIPE (XRUN) > - sdma_disable_channel_async() # get's scheduled away due to sleep us > <- ioctl(SNDRV_PCM_IOCTL_READI_FRAMES) returns -EPIPE us -> > ioctl(SNDRV_PCM_IOCTL_PREPARE) # as reaction to the EPIPE (XRUN) us -> > ioctl(SNDRV_PCM_IOCTL_READI_FRAMES) # next try to capture frames > - sdma_prep_dma_cyclic() > - sdma_load_context() # not loaded as context_loaded is 1 > - wait_for_avail() > - schedule_timeout() > # now the sdma_channel_terminate_work() comes back and sets # > context_loaded = false and frees in vchan_dma_desc_free_list(). > us <- ioctl returns -EIO (capture write error (DMA or IRQ trouble?)) Seems the write error caused by context_loaded not set to false before next transfer start? If yes, please have a try with the 03/04 of the below patch set, anyway, could you post your failure log? https://lkml.org/lkml/2020/8/11/111 > > > What we have found out, based on our understanding: > The dmaengine docu states that a dmaengine_terminate_async() must be > followed by a dmaengine_synchronize(). > However, in the pcm_dmaengine.c, only dmaengine_terminate_async() is > called (for performance reasons and because it might be called from an > interrupt handler). > > In our tests, we saw that the user-space immediately calls > ioctl(SNDRV_PCM_IOCTL_PREPARE) as a handler for the happened xrun > (previous ioctl(SNDRV_PCM_IOCTL_READI_FRAMES) returns with -EPIPE). In > our case (imx-sdma.c), the terminate really happens asynchronously with a > worker thread which is not awaited/synchronized by the > ioctl(SNDRV_PCM_IOCTL_PREPARE) call. > > Since the syscall immediately enters an atomic context > (snd_pcm_stream_lock_irq()), we are not able to flush the work of the > termination worker from within the DMA context. This leads to an > unterminated DMA getting re-initialized and then terminated. > > On the i.MX6 platform the problem is (if I got it correctly) that the > sdma_channel_terminate_work() called after the -EPIPE gets scheduled away > (for the 1-2ms sleep [1]). During that time the userspace already sends in the > ioctl(SNDRV_PCM_IOCTL_PREPARE) and > ioctl(SNDRV_PCM_IOCTL_READI_FRAMES). > As none of them are anyhow synchronized to the terminate_worker the > vchan_dma_desc_free_list() [2] and "sdmac->context_loaded = false;" [3] are > executed during the wait_for_avail() [4] of the > ioctl(SNDRV_PCM_IOCTL_READI_FRAMES). > > To make sure we identified the problem correctly we've tested to add a > "dmaengine_synchronize()" before the snd_pcm_prepare() in [5]. This fixed the > race condition in all our tests. (Before we were able to reproduce it in 100% of > the test runs). > > Based on our understanding, there are two different points to ensure the > termination: > Either ensure that the termination is finished within the previous > SNDRV_PCM_IOCTL_READI_FRAMES call (inside the DMA context) or finishing > it in the SNDRV_PCM_IOCTL_PREPARE call (and all other applicable ioclts) > before entering the atomic context (from the PCM context). > > We initially thought about implementing the first approach, basically splitting > up the dma_device terminate_all operation into a sync > (busy-wait) and a async one. This would align the operations with the > DMAengine interface and would enable a sync termination variant from atomic > contexts. > However, we saw that the dma_free_attrs() function has a WARN_ON on irqs > disabled, which would be the case for the sync variant. > Side note: We found this issue on the current v5.4.y LTS branch, but it also > affects v5.8.y. > > Any feedback or pointers how we may fix the problem are warmly welcome! > If anything is unclear please just ask :-) > > regards; > Richard Leitner > Benjamin Bara > > [1]https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Felixir. > bootlin.com%2Flinux%2Fv5.8%2Fsource%2Fdrivers%2Fdma%2Fimx-sdma.c%23 > L1066&data=02%7C01%7Cyibin.gong%40nxp.com%7C79ad115b01ef453f7 > e7408d83f7b3c4d%7C686ea1d3bc2b4c6fa92cd99c5c301635%7C0%7C0%7C637 > 329145824068928&sdata=D9F%2FRUG27xv9nv8J1KtrLtld2eaI6gsXiWIAIgk > Avjw%3D&reserved=0 > [2]https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Felixir. > bootlin.com%2Flinux%2Fv5.8%2Fsource%2Fdrivers%2Fdma%2Fimx-sdma.c%23 > L1071&data=02%7C01%7Cyibin.gong%40nxp.com%7C79ad115b01ef453f7 > e7408d83f7b3c4d%7C686ea1d3bc2b4c6fa92cd99c5c301635%7C0%7C0%7C637 > 329145824068928&sdata=0EKDVgzOZzL7TpX4ykhqjvpz5ryUHUpWw7frRe > cksBU%3D&reserved=0 > [3]https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Felixir. > bootlin.com%2Flinux%2Fv5.8%2Fsource%2Fdrivers%2Fdma%2Fimx-sdma.c%23 > L1072&data=02%7C01%7Cyibin.gong%40nxp.com%7C79ad115b01ef453f7 > e7408d83f7b3c4d%7C686ea1d3bc2b4c6fa92cd99c5c301635%7C0%7C0%7C637 > 329145824068928&sdata=aIhatvb1ocQqyYCVFEg71LgJlRBoVusbDFPIxnte > PuY%3D&reserved=0 > [4]https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Felixir. > bootlin.com%2Flinux%2Fv5.8%2Fsource%2Fsound%2Fcore%2Fpcm_lib.c%23L1 > 825&data=02%7C01%7Cyibin.gong%40nxp.com%7C79ad115b01ef453f7e > 7408d83f7b3c4d%7C686ea1d3bc2b4c6fa92cd99c5c301635%7C0%7C0%7C6373 > 29145824073919&sdata=y0Udbd%2FKGaVgqLrcp6fNOlMlFCGHCMfojkpp > B4HzUuE%3D&reserved=0 > [5]https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Felixir. > bootlin.com%2Flinux%2Fv5.8%2Fsource%2Fsound%2Fcore%2Fpcm_native.c%2 > 3L3226&data=02%7C01%7Cyibin.gong%40nxp.com%7C79ad115b01ef453f > 7e7408d83f7b3c4d%7C686ea1d3bc2b4c6fa92cd99c5c301635%7C0%7C0%7C63 > 7329145824073919&sdata=ch3BQ5DDGU5HWXqIZSvUeFnBoRoP%2BMM > HEpnk8mIfWj8%3D&reserved=0