On Tue, Jun 11, 2019 at 5:09 AM Robin Gong <yibin.gong@xxxxxxx> wrote: > > Sven, no any dependency from sdma driver view. The only difference between directly loading firmware > from kernel and rootfs is the former spend more time during kernel boot and such timing may cause > the crash. The issue is not 100% in my side, about 20% possibility, which looks like 'timing issue' . Another > interesting thing is that every time the crash stop at somewhere drm, and After I disable ipu and display > which use drm in i.mx6q.dtsi, the issue is gone on my i.mx6q-sabreauto board. > Could you have a try with below patch as mine? If the issue is gone on your side, we could involve drm guys to > look into it. When I apply your patch to ipu and display, the crash still happens on my device. But when I disable NFSv4 network filesystem in defconfig, the crash disappears. Yet on linux-next, the crash is there again, even if I disable the IPU or NFSv4. My guess: we are chasing ghosts, the crashes are purely timing related. Things like disabling the IPU or NFSv4 change boot timing, and this changes the crash. Experiment: If I put msleep(1000) right before the sdma_load_script() call, then the crash never happens. And if I comment out the call to sdma_run_channel0() in sdma_load_script(), then the crash also does not happen. This suggests that the crash is related to the exact timing when sdma_run_channel0() is called. If it is called too early, this results in an 'interrupt storm' on the sdma interrupt handler: it gets called millions of times in a very short amount of time. By adding debug prints, I noticed that the sdma core calls back sdma_alloc_chan_resources(), later during the boot, when a spi bus is created. Experiment: I paused firmware upload until the first time sdma_alloc_chan_resources() is called by the core. I used a struct completion to accomplish this. Result: the crash never happens again. All this suggests very strongly that sdma_run_channel0() is called "too early" by the driver. I don't known enough of imx-sdma to know what is missing during the early call. Here is the patch to delay firmware load until the first sdma_alloc_chan_resources() has completed: diff --git a/drivers/dma/imx-sdma.c b/drivers/dma/imx-sdma.c index 99d9f431ae2c..ddeded5c3337 100644 --- a/drivers/dma/imx-sdma.c +++ b/drivers/dma/imx-sdma.c @@ -33,6 +33,7 @@ #include <linux/of_device.h> #include <linux/of_dma.h> #include <linux/workqueue.h> +#include <linux/completion.h> #include <asm/irq.h> #include <linux/platform_data/dma-imx-sdma.h> @@ -444,6 +445,7 @@ struct sdma_engine { struct sdma_buffer_descriptor *bd0; /* clock ratio for AHB:SDMA core. 1:1 is 1, 2:1 is 0*/ bool clk_ratio; + struct completion chan_resources_alloced; }; static int sdma_config_write(struct dma_chan *chan, @@ -1258,6 +1260,7 @@ static void sdma_desc_free(struct virt_dma_desc *vd) static int sdma_alloc_chan_resources(struct dma_chan *chan) { struct sdma_channel *sdmac = to_sdma_chan(chan); + struct sdma_engine *sdma = sdmac->sdma; struct imx_dma_data *data = chan->private; struct imx_dma_data mem_data; int prio, ret; @@ -1310,6 +1313,7 @@ static int sdma_alloc_chan_resources(struct dma_chan *chan) if (ret) goto disable_clk_ahb; + complete(&sdma->chan_resources_alloced); return 0; disable_clk_ahb: @@ -1724,6 +1728,7 @@ static void sdma_load_firmware(const struct firmware *fw, void *context) /* In this case we just use the ROM firmware. */ return; } + wait_for_completion(&sdma->chan_resources_alloced); if (fw->size < sizeof(*header)) goto err_firmware; @@ -2012,6 +2017,7 @@ static int sdma_probe(struct platform_device *pdev) return -ENOMEM; spin_lock_init(&sdma->channel_0_lock); + init_completion(&sdma->chan_resources_alloced); sdma->dev = &pdev->dev; sdma->drvdata = drvdata;