Re: [PATCH] mmc: sh-mmcif: avoid Oops on spurious interrupts

Guennadi Liakhovetski <g.liakhovetski@xxxxxx> · Tue, 18 Sep 2012 10:56:46 +0200 (CEST)

On Tue, 18 Sep 2012, Tetsuyuki Kobayashi wrote:

> Hello Guennadi
> 
> (09/18/2012 05:02 PM), Tetsuyuki Kobayashi wrote:
> 
> >>>> (2012/08/22 15:49), Guennadi Liakhovetski wrote:
> >>>>> On some systems, e.g., kzm9g, MMCIF interfaces can produce spurious
> >>>>> interrupts without any active request. To prevent the Oops, that results
> >>>>> in such cases, don't dereference the mmc request pointer until we make
> >>>>> sure, that we are indeed processing such a request.
> >>>>>
> >>>>> Reported-by: Tetsuyuki Kobayashi <koba@xxxxxxxxxxx>
> >>>>> Signed-off-by: Guennadi Liakhovetski <g.liakhovetski@xxxxxx>
> >>>>> ---
> >>>>>
> >>>>> Hello Kobayashi-san
> >>>>>
> >>>>> On Mon, 20 Aug 2012, Tetsuyuki Kobayashi wrote:
> >>>>>
> >>>>> ...
> >>>>>
> >>>>>> After applying this patch on kzm9g board, I got this error regarding
> >>>>>> eMMC.
> >>>>>> I think this is another problem.
> >>>>>>
> >>>>>>
> >>>>>> Unable to handle kernel NULL pointer dereference at virtual address
> >>>>>> 00000008
> >>>>>> pgd = c0004000
> >>>>>> [00000008] *pgd=00000000
> >>>>>> Internal error: Oops: 17 [#1] PREEMPT SMP ARM
> >>>>>> Modules linked in:
> >>>>>> CPU: 1    Not tainted  (3.6.0-rc2+ #103)
> >>>>>> PC is at sh_mmcif_irqt+0x20/0xb30
> >>>>>> LR is at irq_thread+0x94/0x16c
> >>>>>
> >>>>> [snip]
> >>>>>
> >>>>>> My quick fix is below.
> >>>>>>
> >>>>>> diff --git a/drivers/mmc/host/sh_mmcif.c b/drivers/mmc/host/sh_mmcif.c
> >>>>>> index 5d81427..e587fbc 100644
> >>>>>> --- a/drivers/mmc/host/sh_mmcif.c
> >>>>>> +++ b/drivers/mmc/host/sh_mmcif.c
> >>>>>> @@ -1104,7 +1104,15 @@ static irqreturn_t sh_mmcif_irqt(int irq, void
> >>>>>> *dev_id)
> >>>>>>   {
> >>>>>>          struct sh_mmcif_host *host = dev_id;
> >>>>>>          struct mmc_request *mrq = host->mrq;
> >>>>>> -       struct mmc_data *data = mrq->data;
> >>>>>> +       /*struct mmc_data *data = mrq->data; -- this cause null
> >>>>>> pointer access*/
> >>>>>> +       struct mmc_data *data;
> >>>>>> +
> >>>>>> +       /* quick fix by koba */
> >>>>>> +       if (mrq == NULL) {
> >>>>>> +               printk("sh_mmcif_irqt: mrq == NULL:
> >>>>>> host->wait_for=%d\n", host->wait_for);
> >>>>>> +       } else {
> >>>>>> +               data = mrq->data;
> >>>>>> +       }
> >>>>>>
> >>>>>>          cancel_delayed_work_sync(&host->timeout_work);
> >>>>>>
> >>>>>>
> >>>>>> With this patch, there is no null pointer accesses and got this log.
> >>>>>>
> >>>>>> sh_mmcif_irqt: mrq == NULL: host->wait_for=0
> >>>>>> sh_mmcif_irqt: mrq == NULL: host->wait_for=0
> >>>>>>    ...
> >>>>>>
> >>>>>> host->wait_for is 0. it is MMCIF_WAIT_FOR_REQUEST.
> >>>>>> There is code such like:
> >>>>>>
> >>>>>>         host->wait_for = MMCIF_WAIT_FOR_REQUEST;
> >>>>>>         host->mrq = NULL;
> >>>>>>
> >>>>>> So, at the top of sh_mmcif_irqt, if host->wait_for ==
> >>>>>> MMCIF_WAIT_FOR_REQUEST,
> >>>>>> host->mrq = NULL.
> >>>>>> It is too earlier to access mrq->data before checking host->mrq. it may
> >>>>>> cause null pointer access.
> >>>>>>
> >>>>>> Goda-san, could you check this and refine the code of sh_mmcif_irqt?
> >>>>>
> >>>>> Thanks for your report and a fix. Could you please double-check, whether
> >>>>> the below patch also fixes your problem? Since such spurious interrupts
> >>>>> are possible I would commit a check like this one, but in the longer run
> >>>>> we want to identify and eliminate them, if possible. But since so far
> >>>>> these interrupts only happen on 1 board model and also not on all units
> >>>>> and not upon each boot, this could be a bit tricky.
> >>>>>
> >>>>> One more question - is this only needed for 3.7 or also for 3.6 / stable?
> >>>>>
> >>>>> Thanks
> >>>>> Guennadi
> >>>>>
> >>>>>   drivers/mmc/host/sh_mmcif.c |    4 ++--
> >>>>>   1 files changed, 2 insertions(+), 2 deletions(-)
> >>>>>
> >>>>> diff --git a/drivers/mmc/host/sh_mmcif.c b/drivers/mmc/host/sh_mmcif.c
> >>>>> index 5d81427..82bf921 100644
> >>>>> --- a/drivers/mmc/host/sh_mmcif.c
> >>>>> +++ b/drivers/mmc/host/sh_mmcif.c
> >>>>> @@ -1104,7 +1104,6 @@ static irqreturn_t sh_mmcif_irqt(int irq, void
> >>>>> *dev_id)
> >>>>>   {
> >>>>>       struct sh_mmcif_host *host = dev_id;
> >>>>>       struct mmc_request *mrq = host->mrq;
> >>>>> -    struct mmc_data *data = mrq->data;
> >>>>>
> >>>>>       cancel_delayed_work_sync(&host->timeout_work);
> >>>>>
> >>>>> @@ -1152,13 +1151,14 @@ static irqreturn_t sh_mmcif_irqt(int irq, void
> >>>>> *dev_id)
> >>>>>       case MMCIF_WAIT_FOR_READ_END:
> >>>>>       case MMCIF_WAIT_FOR_WRITE_END:
> >>>>>           if (host->sd_error)
> >>>>> -            data->error = sh_mmcif_error_manage(host);
> >>>>> +            mrq->data->error = sh_mmcif_error_manage(host);
> >>>>>           break;
> >>>>>       default:
> >>>>>           BUG();
> >>>>>       }
> >>>>>
> >>>>>       if (host->wait_for != MMCIF_WAIT_FOR_STOP) {
> >>>>> +        struct mmc_data *data = mrq->data;
> >>>>>           if (!mrq->cmd->error && data && !data->error)
> >>>>>               data->bytes_xfered =
> >>>>>                   data->blocks * data->blksz;
> >>>>>
> >>>>
> >>>> I tried this patch. It seems better.
> >>>> But I think this still have potential race condition.
> >>>> I am afraid that one cpu enter sh_mmcif_irqt and other cpu write to
> >>>> host->wait_for for new request at the same time.
> >>>> How about add this code at the top of sh_mmcif_irqt or before returning
> >>>> IRQ_WAKE_THREAD in sh_mmcif_intr ?
> >>>>
> >>>>      if (host->state == STATE_IDLE)
> >>>>          return IRQ_HANDLED;
> >>>>
> >>>> I will rebase my test environment to v3.6-rc3 or later. Then I will
> >>>> send you my .config.
> >>>>
> >>> How is this?
> >>> I hope this fixed in v3.6.
> >>
> >> Sorry, I still haven't come round to looking at this. I think, one thing 
> >> could halp, if you could try to find out what exactly those spurious 
> >> interrupts look like, i.e., what's their interrupt status? You could try 
> >> to print like
> >>
> >> diff -u a/drivers/mmc/host/sh_mmcif.c b/drivers/mmc/host/sh_mmcif.c
> >> --- a/drivers/mmc/host/sh_mmcif.c
> >> +++ b/drivers/mmc/host/sh_mmcif.c
> >> @@ -1229,6 +1229,10 @@
> >>  		host->sd_error = true;
> >>  		dev_dbg(&host->pd->dev, "int err state = %08x\n", state);
> >>  	}
> >> +	if (host->state == STATE_IDLE) {
> >> +		dev_info(&host->pd->dev, "Spurious IRQ status 0x%x", state);
> >> +		return IRQ_HANDLED;
> >> +	}
> >>  	if (state & ~(INT_CMD12RBE | INT_CMD12CRE)) {
> >>  		if (!host->dma_active)
> >>  			return IRQ_WAKE_THREAD;
> >>
> >> Please, let me know, if it's not very easy for you ATM to perform the 
> >> test, I'll try to do it myself then.
> > 
> > OK. It is easy for me.
> > I got this log after mounting /dev/mmcblk2p1 (on eMMC) and executing
> > tar command to make file accesses.
> > This is based on v3.6-rc6.
> > 
> > [  149.968750] EXT4-fs (mmcblk2p1): warning: maximal mount count reached, running e2fsck is recommended
> > [  150.296875] EXT4-fs (mmcblk2p1): mounted filesystem with ordered data mode. Opts: (null)
> > [  221.539062] sh_mmcif sh_mmcif.0: Spurious IRQ status 0x7400000
> > [  221.585937] sh_mmcif sh_mmcif.0: Spurious IRQ status 0x7400000
> > [  222.039062] sh_mmcif sh_mmcif.0: Spurious IRQ status 0x7400000
> > [  222.226562] sh_mmcif sh_mmcif.0: Spurious IRQ status 0x7400000
> > [  222.382812] sh_mmcif sh_mmcif.0: Spurious IRQ status 0x7400000
> > [  223.109375] sh_mmcif sh_mmcif.0: Spurious IRQ status 0x7400000
> > [  223.406250] sh_mmcif sh_mmcif.0: Spurious IRQ status 0x7400000
> > [  223.734375] sh_mmcif sh_mmcif.0: Spurious IRQ status 0x7400000
> > [  223.750000] sh_mmcif sh_mmcif.0: Spurious IRQ status 0x7400000
> > [  224.398437] sh_mmcif sh_mmcif.0: Spurious IRQ status 0x7400000
> > [  230.289062] sh_mmcif sh_mmcif.0: Spurious IRQ status 0x7400000
> > 
> It might too earlier to report. I got this log.
> 
> [  149.968750] EXT4-fs (mmcblk2p1): warning: maximal mount count reached, running e2fsck is recommended
> [  150.296875] EXT4-fs (mmcblk2p1): mounted filesystem with ordered data mode. Opts: (null)
> [  221.539062] sh_mmcif sh_mmcif.0: Spurious IRQ status 0x7400000
...
> [  998.554687] sh_mmcif sh_mmcif.0: Spurious IRQ status 0x3000000
...
> [ 1702.140625] sh_mmcif sh_mmcif.0: Spurious IRQ status 0x3800000
> [ 1702.148437] sh_mmcif sh_mmcif.0: Spurious IRQ status 0x3000000
...
> [ 1721.171875] sh_mmcif sh_mmcif.0: Spurious IRQ status 0x3800000
> [ 1721.171875] sh_mmcif sh_mmcif.0: Spurious IRQ status 0x3000000
...
> [ 1934.296875] sh_mmcif sh_mmcif.0: Spurious IRQ status 0x7400000
...

So, I see 3 IRQ types there: auto CMD12 only (0x3000000), auto CMD12 with 
read completed (0x7400000), auto CMD12 with "Data Transmission Complete."

So, all of them are related to automatic CMD12 execution. I think, we 
should try to finc out now, how and when this auto CMD12 gets enabled, 
when it should be disabled again, and whether we have a problem with the 
logic there. Possibly, eMMC processing of CMD12 is different from normal 
MMC cards.

> After that, I tried the same thing without DMA by comment out like this.
> In this case spurious IRQ never occurred. 
> 
> diff --git a/arch/arm/mach-shmobile/board-kzm9g.c b/arch/arm/mach-shmobile/board
> index 765f60a..d5e6609 100644
> --- a/arch/arm/mach-shmobile/board-kzm9g.c
> +++ b/arch/arm/mach-shmobile/board-kzm9g.c
> @@ -389,8 +389,8 @@ static struct resource sh_mmcif_resources[] = {
>  static struct sh_mmcif_plat_data sh_mmcif_platdata = {
>         .ocr            = MMC_VDD_165_195,
>         .caps           = MMC_CAP_8_BIT_DATA | MMC_CAP_NONREMOVABLE,
> -       .slave_id_tx    = SHDMA_SLAVE_MMCIF_TX,
> -       .slave_id_rx    = SHDMA_SLAVE_MMCIF_RX,
> +       /*.slave_id_tx  = SHDMA_SLAVE_MMCIF_TX,*/
> +       /*.slave_id_rx  = SHDMA_SLAVE_MMCIF_RX,*/
>  };
> 
> 

Interesting. I don't so far see how this is related, but it's interesting 
to know.

Thanks
Guennadi
---
Guennadi Liakhovetski, Ph.D.
Freelance Open-Source Software Developer
http://www.open-technology.de/
--
To unsubscribe from this list: send the line "unsubscribe linux-mmc" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html