Hi Lucas, Thanks for the feedback. >-----Original Message----- >From: Lucas Stach [mailto:l.stach@xxxxxxxxxxxxxx] >Sent: October 3, 2019 5:26 PM >To: Kevin Groeneveld; dmaengine@xxxxxxxxxxxxxxx >Subject: Re: kernel freeze regression relating to imx-sdma > >Hi Kevin, > >Am Donnerstag, den 03.10.2019, 21:11 +0000 schrieb Kevin Groeneveld: >> We have recently come across an issue which can cause the linux kernel to >> freeze on an iMX6 based system. After bisecting the issue I have found the >> issue is related to the following imx-sdma commits: >> >> 64068853bc77786d1a28abb4087d6a3e93aedbe2 dmaengine: imx-sdma: use GFP_NOWAIT for dma descriptor allocations >> b8603d2a5795c42f78998e70dc792336e0dc20c9 dmaengine: imx-sdma: implement channel termination via worker >> ebb853b1bd5f659b92c71dc6a9de44cfc37c78c0 Revert "dmaengine: imx-sdma: alloclate bd memory from dma pool" >> c06abca69218ac42fa58d1ba7a7b0d9bab5f1b18 Revert "dmaengine: imx-sdma: Use GFP_NOWAIT for dma allocations" >> >> If I revert these four commits the freeze does not happen. The issue is still >> present up to at least v5.3. >> >> The problem seems to be related to overruns when doing an SPDIF capture, >> particularly when there is not actually a valid SPDIF input to the iMX6. When >> the issue occurs the following is (usually) printed on the console (using a v5.0 >> based kernel): >> >> [ 1251.944877] rcu: INFO: rcu_sched detected stalls on CPUs/tasks: >> [ 1251.950823] rcu: (detected by 0, t=2102 jiffies, g=220425, q=0) >> [ 1251.956866] rcu: All QSes seen, last rcu_sched kthread activity 2102 (95117-93015), jiffies_till_next_fqs=1, root ->qsmask 0x0 >> [ 1251.968266] arecord S 0 1916 1908 0x00000002 >> [ 1251.973816] [<8010dcec>] (unwind_backtrace) from [<8010a4c4>] (show_stack+0x10/0x14) >> [ 1251.981596] [<8010a4c4>] (show_stack) from [<8016b314>] (rcu_check_callbacks+0x57c/0x5d8) >> [ 1251.989806] [<8016b314>] (rcu_check_callbacks) from [<8016fe84>] (update_process_times+0x30/0x5c) >> [ 1251.998709] [<8016fe84>] (update_process_times) from [<8017e960>] (tick_sched_timer+0x44/0x94) >> [ 1252.007353] [<8017e960>] (tick_sched_timer) from [<80170a84>] (__hrtimer_run_queues+0xd8/0x160) >> [ 1252.016081] [<80170a84>] (__hrtimer_run_queues) from [<8017103c>] (hrtimer_interrupt+0xdc/0x284) >> [ 1252.024891] [<8017103c>] (hrtimer_interrupt) from [<8010d6c0>] (twd_handler+0x2c/0x38) >> [ 1252.032839] [<8010d6c0>] (twd_handler) from [<80160b34>] (handle_percpu_devid_irq+0x70/0x11c) >> [ 1252.041401] [<80160b34>] (handle_percpu_devid_irq) from [<8015c234>] (generic_handle_irq+0x18/0x28) >> [ 1252.050480] [<8015c234>] (generic_handle_irq) from [<8015c814>] (__handle_domain_irq+0xa0/0xb4) >> [ 1252.059211] [<8015c814>] (__handle_domain_irq) from [<8039098c>] (gic_handle_irq+0x58/0x90) >> [ 1252.067590] [<8039098c>] (gic_handle_irq) from [<80101a0c>] (__irq_svc+0x6c/0x90) >> [ 1252.075085] Exception stack(0x848afe30 to 0x848afe78) >> [ 1252.080155] fe20: 8decf200 00000000 00000000 80b84ac8 >> [ 1252.088354] fe40: 00000000 8decf200 892f7800 00001589 892f7800 00000000 00000000 848ae000 >> [ 1252.096549] fe60: 8d8cf000 848afe80 805721e8 8056aae4 60000013 ffffffff >> [ 1252.103200] [<80101a0c>] (__irq_svc) from [<8056aae4>] (__snd_pcm_stream_unlock_mode+0xc0/0xd8) >> [ 1252.111925] [<8056aae4>] (__snd_pcm_stream_unlock_mode) from [<00000000>] ( (null)) >> [ 1252.119695] rcu: rcu_sched kthread starved for 2102 jiffies! g220425 f0x2 RCU_GP_WAIT_FQS(5) ->state=0x0 ->cpu=0 >> [ 1252.129877] rcu: RCU grace-period kthread stack dump: >> [ 1252.134938] rcu_sched R running task 0 10 2 0x00000000 >> [ 1252.142034] [<806eee5c>] (__schedule) from [<806eef80>] (schedule+0x90/0xa0) >> [ 1252.149116] [<806eef80>] (schedule) from [<806f2244>] (schedule_timeout+0x1f0/0x238) >> [ 1252.156891] [<806f2244>] (schedule_timeout) from [<8016940c>] (rcu_gp_kthread+0x550/0x8f0) >> [ 1252.165183] [<8016940c>] (rcu_gp_kthread) from [<8013a358>] (kthread+0x110/0x128) >> [ 1252.172693] [<8013a358>] (kthread) from [<801010e8>] (ret_from_fork+0x14/0x2c) >> [ 1252.179925] Exception stack(0x8f877fb0 to 0x8f877ff8) >> [ 1252.184991] 7fa0: 00000000 00000000 00000000 00000000 >> [ 1252.193187] 7fc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 >> [ 1252.201380] 7fe0: 00000000 00000000 00000000 00000000 00000013 00000000 >> >> I suspect the issue may not actually be a bug with imx-sdma but a >> race condition elsewhere that is exposed by the four mentioned >> commits. >> >> I did notice that when an SPDIF overrun occurs >> dmaengine_terminate_async is called but dmaengine_synchronize is >> not. Should these calls always be paired? Or is it okay to call >> dmaengine_terminate_async without dmaengine_synchronize in some >> cases? > >Yes, it seems this uncovered a bug in the SPDIF driver. Before those >commits the dmaengine_terminate_async in the SDMA driver wasn't really >async, now it is. The API requires that you call dmaengine_synchronize >eventually after the terminate_async before starting any new operations >on the dmaengine. It doesn't necessarily need to be in the same >codepath, but the the async terminate always needs to be paired up with >a synchronize. I don't think the bug is limited to SPDIF. The calls to dmaengine_terminate_async are in sound/core/pcm_dmaengine.c. When the ALSA device is closed it does call dmaengine_synchronize. But when there is an overrun it does not. It was commit bc0e7345168c0f7483d2d1da86285d89136417cd (by Lars-Peter Clausen) where the call to dmaengine_synchronize was added for closing an ALSA device. It seems the overrun case was overlooked? I think maybe I should copy the ALSA list on this as well. I have also found another manifestation of the imx-sdma change which may be useful for others trying to debug this. If I run the following command: arecord -Dhw:4,0 -fS24_LE -c2 -r44100 --buffer-size 100 > /dev/null With kernel v4.20-rc5 I get an endless stream of overrun messages. With kernel v4.20-rc6 (which includes the four imx-sdma commits mentioned above) I get a few overrun messages but then arecord exits with "arecord: pcm_read:2096: read error: Input/output error" Best regards, Kevin