Re: renesas_sdhi problems in 5.10-stable was Re: [PATCH 5.10 000/226] 5.10.198-rc1 review

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Wed, Oct 25, 2023 at 11:26 PM Geert Uytterhoeven
<geert@xxxxxxxxxxxxxx> wrote:
> On Wed, Oct 25, 2023 at 9:53 PM Geert Uytterhoeven <geert@xxxxxxxxxxxxxx> wrote:
> > On Wed, Oct 25, 2023 at 8:39 PM Guenter Roeck <linux@xxxxxxxxxxxx> wrote:
> > > On 10/25/23 10:05, Geert Uytterhoeven wrote:
> > > > On Wed, Oct 25, 2023 at 2:35 PM Geert Uytterhoeven <geert@xxxxxxxxxxxxxx> wrote:
> > > >> On Wed, Oct 25, 2023 at 12:53 PM Geert Uytterhoeven
> > > >> <geert@xxxxxxxxxxxxxx> wrote:
> > > >>> On Wed, Oct 25, 2023 at 12:47 PM Geert Uytterhoeven
> > > >>> <geert@xxxxxxxxxxxxxx> wrote:
> > > >>>> On Tue, Oct 24, 2023 at 9:22 PM Pavel Machek <pavel@xxxxxxx> wrote:
> > > >>>>> But we still have failures on Renesas with 5.10.199-rc2:
> > > >>>>>
> > > >>>>> https://gitlab.com/cip-project/cip-testing/linux-stable-rc-ci/-/pipelines/1047368849
> > > >>>>>
> > > >>>>> And they still happed during MMC init:
> > > >>>>>
> > > >>>>>      2.638013] renesas_sdhi_internal_dmac ee100000.mmc: Got CD GPIO
> > > >>>>> [    2.638846] INFO: trying to register non-static key.
> > > >>>>> [    2.644192] ledtrig-cpu: registered to indicate activity on CPUs
> > > >>>>> [    2.649066] The code is fine but needs lockdep annotation, or maybe
> > > >>>>> [    2.649069] you didn't initialize this object before use?
> > > >>>>> [    2.649071] turning off the locking correctness validator.
> > > >>>>> [    2.649080] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.10.199-rc2-arm64-renesas-ge31b6513c43d #1
> > > >>>>> [    2.649082] Hardware name: HopeRun HiHope RZ/G2M with sub board (DT)
> > > >>>>> [    2.649086] Call trace:
> > > >>>>> [    2.655106] SMCCC: SOC_ID: ARCH_SOC_ID not implemented, skipping ....
> > > >>>>> [    2.661354]  dump_backtrace+0x0/0x194
> > > >>>>> [    2.661361]  show_stack+0x14/0x20
> > > >>>>> [    2.667430] usbcore: registered new interface driver usbhid
> > > >>>>> [    2.672230]  dump_stack+0xe8/0x130
> > > >>>>> [    2.672238]  register_lock_class+0x480/0x514
> > > >>>>> [    2.672244]  __lock_acquire+0x74/0x20ec
> > > >>>>> [    2.681113] usbhid: USB HID core driver
> > > >>>>> [    2.687450]  lock_acquire+0x218/0x350
> > > >>>>> [    2.687456]  _raw_spin_lock+0x58/0x80
> > > >>>>> [    2.687464]  tmio_mmc_irq+0x410/0x9ac
> > > >>>>> [    2.688556] renesas_sdhi_internal_dmac ee160000.mmc: mmc0 base at 0x00000000ee160000, max clock rate 200 MHz
> > > >>>>> [    2.744936]  __handle_irq_event_percpu+0xbc/0x340
> > > >>>>> [    2.749635]  handle_irq_event+0x60/0x100
> > > >>>>> [    2.753553]  handle_fasteoi_irq+0xa0/0x1ec
> > > >>>>> [    2.757644]  __handle_domain_irq+0x7c/0xdc
> > > >>>>> [    2.761736]  efi_header_end+0x4c/0xd0
> > > >>>>> [    2.765393]  el1_irq+0xcc/0x180
> > > >>>>> [    2.768530]  arch_cpu_idle+0x14/0x2c
> > > >>>>> [    2.772100]  default_idle_call+0x58/0xe4
> > > >>>>> [    2.776019]  do_idle+0x244/0x2c0
> > > >>>>> [    2.779242]  cpu_startup_entry+0x20/0x6c
> > > >>>>> [    2.783160]  rest_init+0x164/0x28c
> > > >>>>> [    2.786561]  arch_call_rest_init+0xc/0x14
> > > >>>>> [    2.790565]  start_kernel+0x4c4/0x4f8
> > > >>>>> [    2.794233] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000014
> > > >>>>> [    2.803011] Mem abort info:
> > > >>>>>
> > > >>>>> from https://lava.ciplatform.org/scheduler/job/1025535
> > > >>>>> from
> > > >>>>> https://gitlab.com/cip-project/cip-testing/linux-stable-rc-ci/-/jobs/5360973735 .
> > > >>>>>
> > > >>>>> Is there something else missing?
> > > >>
> > > >> It seems to be an intermittent issue. Investigating...
> > > >
> > > > After spending too much time on bisecting, the bad guy turns out to
> > > > be commit 6d3745bbc3341d3b ("mmc: renesas_sdhi: register irqs before
> > > > registering controller") in v5.10.198.
> > > >
> > > > Adding debug information shows the lock is mmc_host.lock.
> > > >
> > > > It is definitely initialized:
> > > >
> > > >      renesas_sdhi_probe()
> > > >      {
> > > >          ...
> > > >          tmio_mmc_host_alloc()
> > > >              mmc_alloc_host
> > > >                  spin_lock_init(&host->lock);
>
> Initializing mmc_host.lock.
>
> > > >          ...
> > > >          devm_request_irq()
> > > >          -> tmio_mmc_irq
> > > >              tmio_mmc_cmd_irq()
> > > >                  spin_lock(&host->lock);
>
> Locking tmio_mmc_host.lock, but ...
>
> > > >          ...
> > > >      }
> > > >
> > > > That leaves us with a missing lockdep annotation?
> > >
> > > Is it possible that the lock initialization is overwritten ?
> > > I seem to recall a recent case where this happens.
> > >
> > > Also, there is
> > >         spin_lock_init(&_host->lock);
> > > in tmio_mmc_host_probe(), and tmio_mmc_host_probe() is called after
> > > devm_request_irq().
> >
> > Unless I am missing something, that is initializing tmio_mmc_host.lock,
> > which is a different lock than mmc_host.lock?
>
> ... tmio_mmc_host.lock is initialized only here.
>
> Now the question remains why this is not triggered in mainline.
> More investigation to do tomorrow...

| --- a/drivers/mmc/host/renesas_sdhi_core.c
| +++ b/drivers/mmc/host/renesas_sdhi_core.c
| @@ -1011,6 +1011,8 @@ int renesas_sdhi_probe(struct platform_device *pdev,
|                         renesas_sdhi_start_signal_voltage_switch;
|                 host->sdcard_irq_setbit_mask = TMIO_STAT_ALWAYS_SET_27;
|                 host->reset = renesas_sdhi_reset;

host->sdcard_irq_mask_all is not initialized in this branch

| +       } else {
| +               host->sdcard_irq_mask_all = TMIO_MASK_ALL;
|         }

|         /* Orginally registers were 16 bit apart, could be 32 or 64
nowadays */
| @@ -1098,9 +1100,7 @@ int renesas_sdhi_probe(struct platform_device *pdev,
|                 host->ops.hs400_complete = renesas_sdhi_hs400_complete;
|         }

| -       ret = tmio_mmc_host_probe(host);
| -       if (ret < 0)
| -               goto edisclk;
| +       sd_ctrl_write32_as_16_and_16(host, CTL_IRQ_MASK,
host->sdcard_irq_mask_all);

Fails to disable interrupts for real as host->sdcard_irq_mask_all is
still zero.

|         num_irqs = platform_irq_count(pdev);
|         if (num_irqs < 0) {
| @@ -1127,6 +1127,10 @@ int renesas_sdhi_probe(struct platform_device *pdev,
|                         goto eirq;
|         }

| +       ret = tmio_mmc_host_probe(host);

Initializes host->sdcard_irq_mask_all when needed and disables
interrupts:

        if (!_host->sdcard_irq_mask_all)
                _host->sdcard_irq_mask_all = TMIO_MASK_ALL;
        tmio_mmc_disable_mmc_irqs(_host, _host->sdcard_irq_mask_all);

If the interrupt came in before, we have an issue.

| +       if (ret < 0)
| +               goto edisclk;
| +
|         dev_info(&pdev->dev, "%s base at %pa, max clock rate %u MHz\n",
|                  mmc_hostname(host->mmc), &res->start,
host->mmc->f_max / 1000000);

The solution is to backport commit 9f12cac1bb88e329 ("mmc: renesas_sdhi:
use custom mask for TMIO_MASK_ALL") in v5.13.
As this doesn't backport cleanly, I'll submit a (tested) patch.

Gr{oetje,eeting}s,

                        Geert

-- 
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@xxxxxxxxxxxxxx

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
                                -- Linus Torvalds



[Index of Archives]     [Linux Samsung SOC]     [Linux Wireless]     [Linux Kernel]     [ATH6KL]     [Linux Bluetooth]     [Linux Netdev]     [Kernel Newbies]     [IDE]     [Security]     [Git]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux ATA RAID]     [Samba]     [Device Mapper]

  Powered by Linux