Re: pcm|dmaengine|imx-sdma race condition on i.MX6

Lars-Peter Clausen <lars@xxxxxxxxxx> · Wed, 19 Aug 2020 13:08:29 +0200

On 8/17/20 9:28 AM, Benjamin Bara - SKIDATA wrote:
We think this is not an i.MX6-specific problem, but a problem of the DMAengine usage from the PCM.
In case of a XRUN, the DMA channel is never closed but first a SNDRV_PCM_TRIGGER_STOP next a
SNDRV_PCM_TRIGGER_START is triggered.
The SNDRV_PCM_TRIGGER_STOP simply executes a dmaengine_terminate_async() [1]
but does not await the termination by calling dmaengine_synchronize(),
which is required as stated by the docu [2].
Anyways, we are not able to fix it in the pcm_dmaengine layer either at the end of
SNDRV_PCM_TRIGGER_STOP (called from the DMA on complete interrupt handler)
or at the beginning of SNDRV_PCM_TRIGGER_START (called from a PCM ioctl),
since the dmaengine_synchronize() requires a non-atomic context.

I think this might be an sdma specific problem after all. 
dmaengine_terminate_async() will issue a request to stop the DMA. But it 
is still safe to issue the next transfer, even without calling 
dmaengine_synchronize(). The DMA should start the new transfer at its 
earliest convenience in that case.

dmaegine_synchronize() is so that the consumer has a guarantee that the 
DMA is finished using the resources (e.g. the memory buffers) associated 
with the DMA transfer so it can safely free them.

Based on my understanding, most of the DMA implementations don't even implement device_synchronize
and if they do, it might not really be necessary since the terminate_all operation is synchron.
There are a lot of buggy DMAengine drivers :) Pretty much all of them 
need device_synchronize() to get this right.

With the i.MX6, it looks a bit different:
Since [4], the terminate_all operation really schedules a worker which waits the required ~1ms and
then does the context freeing.
Now, the ioctl(SNDRV_PCM_IOCTL_PREPARE) and the following ioctl(SNDRV_PCM_IOCTL_READI_FRAMES),
which are called from US to handle/recover from a XRUN, are in a race with the terminate_worker.
If the terminate_worker finishes earlier, everything is fine.
Otherwise, the sdma_prep_dma_cyclic() is called, sets up everything and
as soon as it is scheduled out to wait for data, the terminate_worker is scheduled and kills it.
In this case, we wait in [5] until the timeout is reached and return with -EIO.

Based on our understanding, there exist two different fixing approaches:
We thought that the pcm_dmaengine should handle this by either synchronizing the DMA on a trigger or
terminating it synchronously.
However, as we are in an atomic context, we either have to give up the atomic context of the PCM
to finish the termination or we have to design a synchronous terminate variant which is callable
from an atomic context.

For the first option, which is potentially more performant, we have to leave the atomic PCM context
and we are not sure if we are allowed to.
For the second option, we would have to divide the dma_device terminate_all into an atomic sync and
an async one, which would align with the dmaengine API, giving it the option to ensure termination
in an atomic context.
Based on my understanding, most of them are synchronous anyways, for the currently async ones we
would have to implement busy waits.
However, with this approach, we reach the WARN_ON [6] inside of an atomic context,
indicating we might not do the right thing.

I don't know how feasible this is to implement in the SDMA dmaengine 
driver. But I think what is should do is to have some flag to indicate 
if a terminate is in progress. If a new transfer is issued while 
terminate is in progress the transfer should go on a list. Once 
terminate finishes it should check the list and start the transfer if 
there are any on the list.

- Lars