Re: [PATCH] arm64: dts: rockchip: Fix rk3399-gru-* s2r (pinctrl hogs, wifi reset)

Heiko Stübner <heiko@xxxxxxxxx> · Mon, 12 Mar 2018 14:31:43 +0100

Am Mittwoch, 7. März 2018, 05:54:32 CET schrieb Doug Anderson:
> Hi,
> 
> On Tue, Mar 6, 2018 at 10:58 AM, Marc Zyngier <marc.zyngier@xxxxxxx> wrote:
> > On 06/03/18 18:49, Heiko Stübner wrote:
> >> Am Dienstag, 6. März 2018, 19:15:18 CET schrieb Marc Zyngier:
> >>> Hi Doug,
> >>> 
> >>> On 06/03/18 18:00, Doug Anderson wrote:
> >>>> Hi,
> >>>> 
> >>>> On Tue, Mar 6, 2018 at 3:58 AM, Marc Zyngier <marc.zyngier@xxxxxxx> 
wrote:
> >>>>> Hi all,
> >>>>> 
> >>>>> On 01/03/18 08:43, Heiko Stübner wrote:
> >>>>>> Am Dienstag, 27. Februar 2018, 21:47:11 CET schrieb Douglas Anderson:
> >>>>>>> Back in the early days when gru devices were still under development
> >>>>>>> we found an issue where the WiFi reset line needed to be configured
> >>>>>>> as
> >>>>>>> early as possible during the boot process to avoid the WiFi module
> >>>>>>> being in a bad state.
> >>>>>>> 
> >>>>>>> We found that the way to get the kernel to do this in the earliest
> >>>>>>> possible place was to configure this line in the pinctrl hogs, so
> >>>>>>> that's what we did.  For some history here you can see
> >>>>>>> <http://crosreview.com/368770>.  After the time that change landed
> >>>>>>> in
> >>>>>>> the kernel, we landed a firmware change to configure this line even
> >>>>>>> earlier.  See <http://crosreview.com/399919>.  However, even after
> >>>>>>> the
> >>>>>>> firmware change landed we kept the kernel change to deal with the
> >>>>>>> fact
> >>>>>>> that some people working on devices might take a little while to
> >>>>>>> update their firmware.
> >>>>>>> 
> >>>>>>> At this there are definitely zero devices out in the wild that have
> >>>>>>> firmware without the fix in it.  Specifically looking in the
> >>>>>>> firmware
> >>>>>>> branch several critically important fixes for memory stability
> >>>>>>> landed
> >>>>>>> after the patch in coreboot and I know we didn't ship without those.
> >>>>>>> Thus, by now, everyone should have the new firmware and it's safe to
> >>>>>>> not have the kernel set this up in a pinctrl hog.
> >>>>>>> 
> >>>>>>> Historically, even though it wasn't needed to have this in a pinctrl
> >>>>>>> hog, we still kept it since it didn't hurt.  Pinctrl would apply the
> >>>>>>> default hog at bootup and then would never touch things again.  That
> >>>>>>> all changed with commit 981ed1bfbc6c ("pinctrl: Really force states
> >>>>>>> during suspend/resume").  After that commit then we'll re-apply the
> >>>>>>> default hog at resume time and that can screw up the reset state of
> >>>>>>> WiFi.  ...and on rk3399 if you touch a device on PCIe in the wrong
> >>>>>>> way
> >>>>>>> then the whole system can go haywire.  That's what was happening.
> >>>>>>> Specifically you'd resume a rk3399-gru-* device and it would mostly
> >>>>>>> resume, then would crash with some crazy weird crash.
> >>>>>>> 
> >>>>>>> One could say, perhaps, that the recent pinctrl change was at fault
> >>>>>>> (and should be fixed) since it changed behavior.  ...but that's not
> >>>>>>> really true.  The device tree for rk3399-gru is really to blame.
> >>>>>>> Specifically since the pinctrl is defined in the hog and not in the
> >>>>>>> "wlan-pd-n" node then the actual user of this pin doesn't have a
> >>>>>>> pinctrl entry for it.  That's bad.
> >>>>>>> 
> >>>>>>> Let's fix our problems by just moving the control of
> >>>>>>> "wlan_module_reset_l pinctrl" out of the hog and put them in the
> >>>>>>> proper place.
> >>>>>>> 
> >>>>>>> NOTE: in theory, I think it should actually be possible to have a
> >>>>>>> pin
> >>>>>>> controlled _both_ by the hog and by an actual device.  Once the
> >>>>>>> device
> >>>>>>> claims the pin I think the hog is supposed to let go.  I'm not 100%
> >>>>>>> sure that this works and in any case this solution would be more
> >>>>>>> complex than is necessary.
> >>>>>>> 
> >>>>>>> Reported-by: Marc Zyngier <marc.zyngier@xxxxxxx>
> >>>>>>> Fixes: 48f4d9796d99 ("arm64: dts: rockchip: add Gru/Kevin DTS")
> >>>>>>> Fixes: 981ed1bfbc6c ("pinctrl: Really force states during
> >>>>>>> suspend/resume")
> >>>>>>> Signed-off-by: Douglas Anderson <dianders@xxxxxxxxxxxx>
> >>>>>> 
> >>>>>> applied as fix for 4.16 with the 2 Tested-tags
> >>>>> 
> >>>>> Sorry to rain on everyone's parade, but further testing shows that
> >>>>> this
> >>>>> patch may not be enough to restore a reliable s2r. My initial testing
> >>>>> did show that we were resuming without the VOP errors, but there seem
> >>>>> to
> >>>>> be further issues (I'm loosing the keyboard and the trackpad after
> >>>>> resume on Kevin).
> >>>>> 
> >>>>> Applying my initial hack makes it work again. I suspect that there are
> >>>>> more hog pins that need tweaking, but I'm a bit out of my depth here.
> >>>> 
> >>>> Are you positive you weren't just wearing your lucky hat when you
> >>>> tested your patch and then took it off when you tested mine?  As far
> >>> 
> >>> So far, I seem to have a 100% success rate in resuming with my silly
> >>> hack, whist your DT patch alone only gives me a 50% rate at best.
> >>> 
> >>>> as I can see the only hogs left on kevin are:
> >>>>   &ap_pwroff      /* AP will auto-assert this when in S3 */
> >>>>   &clk_32k        /* This pin is always 32k on gru boards */
> >>>> 
> >>>> Those map to:
> >>>>   ap_pwroff: ap-pwroff {
> >>>>   
> >>>>      rockchip,pins = <1 5 RK_FUNC_1 &pcfg_pull_none>;
> >>>>   
> >>>>   };
> >>>>   
> >>>>   clk_32k: clk-32k {
> >>>>   
> >>>>     rockchip,pins = <0 0 RK_FUNC_2 &pcfg_pull_none>;
> >>>>   
> >>>>   };
> >>>> 
> >>>> So I added some printouts at suspend/resume time.  Specifically I set
> >>>> a boolean to "true" for the duration rockchip_pinctrl_suspend() and
> >>>> rockchip_pinctrl_resume() and this turned on a printout in
> >>>> rockchip_set_mux().  My printout looked like this (yeah, I know it's a
> >>>> whitespace-damaged patch just to show what I'm doing):
> >>>> 
> >>>> +       regmap_read(regmap, reg, &before);
> >>>> 
> >>>>         data = (mask << (bit + 16));
> >>>>         rmask = data | (data >> 16);
> >>>>         data |= (mux & mask) << bit;
> >>>>         ret = regmap_update_bits(regmap, reg, rmask, data);
> >>>> 
> >>>> +       regmap_read(regmap, reg, &after);
> >>>> +
> >>>> +       if (DOUG) {
> >>>> +               dev_info(info->dev,
> >>>> +                        "setting mux of GPIO%d-%d to %d;
> >>>> %#010x=>%#010x\n", +                        bank->bank_num, pin, mux,
> >>>> reg, before, after); +       }
> >>>> 
> >>>> ...and a similar one in rockchip_set_pull().  That showed this at
> >>>> resume
> >>>> time:
> >>>> 
> >>>> [   62.284427] rockchip-pinctrl pinctrl: setting mux of GPIO1-5 to 1;
> >>>> 0x00009400=>0x00009400
> >>>> [   62.294423] rockchip-pinctrl pinctrl: setting pull of GPIO1-5;
> >>>> 0x000041aa=>0x000041aa
> >>>> [   62.303343] rockchip-pinctrl pinctrl: setting mux of GPIO0-0 to 2;
> >>>> 0x00005002=>0x00005002
> >>>> [   62.313240] rockchip-pinctrl pinctrl: setting pull of GPIO0-0;
> >>>> 0x00000ddc=>0x00000ddc
> >>>> [
> >>>> 
> >>>> Said another way: pinmux and pull isn't actually changing due to the
> >>>> hogs.  We can see if something else could be changing, but I'd really
> >>>> want to be sure you're certain that the hogs are causing you
> >>>> problems...
> >>> 
> >>> I cannot say for sure that the hogs are the issue. But I thought that
> >>> they were the only pins affected by 981ed1bfbc6c... If this patch can
> >>> affect other pins, then I'm probably barking up the wrong tree.
> >> 
> >> On Kevin I see something like
> >> 
> >> [   60.764129] cros-ec-spi spi2.0: spi transfer failed: -108
> >> [   60.764132] cros-ec-spi spi2.0: cs-deassert spi transfer failed: -108
> >> [   60.764136] cros-ec-spi spi2.0: Command xfer error (err:-108)
> >> [   60.764365] cros-ec-spi spi2.0: spi transfer failed: -108
> >> [   60.764368] cros-ec-spi spi2.0: cs-deassert spi transfer failed: -108
> >> [   60.764371] cros-ec-spi spi2.0: Command xfer error (err:-108)
> >> 
> >> on resume with my current for-next. So maybe your hack just
> >> happened to change some timing during resume?
> > 
> > No, I carry yet another patch to make that one work[1].
> > 
> >> Suspend/Resume also disconnects my usb-ethernet, making me lose my
> >> nfsroot, so I can test this once every boot only.
> 
> Yeah, it kills usb-ethernet for me too.  ...so I can suspend/resume
> once and then the next suspend fails with a bunch of usb errors.  This
> is based on your "hack/kevin-4.16", which looks like:
> 
> e7934b797f4b (HEAD, linux_arm-platforms/hack/kevin-4.16) arm64: dts:
> rockchip: Fix rk3399-gru-* s2r (pinctrl hogs, wifi reset)
> 9b9988414f44 irqchip/gic-v3: Allow LPIs to be disabled from the command line
> d5e06c686858 iommu/rockchip: Perform a reset on shutdown
> 55b36a99626f drm/rockchip: Don't use spin_lock_irqsave in interrupt context
> 7d72d7e57c2c drm/rockchip: Do not use memcpy for MMIO addresses
> 236afcd0425c drm/rockchip: Clear all interrupts before requesting the IRQ
> 31608ae0d3fc arm64: Enable dynamic sched_domain flag setting
> 1b255643cdc3 drivers/base/arch_topology: Dynamic sched_domain flag detection
> 2caca1b31f89 arm64: rk3399: Add capacity-dmips-mhz attributes
> 36ced612e4d3 mfd: cros_ec: add RTC as mfd subdevice
> 4e95dc697ec6 spi: rockchip: Convert to late and early system PM callbacks
> d27d6da92086 drm/rockchip: analogix_dp: Ensure that the bridge is
> powered before poking it
> bcce86412ec1 arm64: DT: rk3399: Add missing EDP clock
> c04436bd57b8 bootloader cmdline
> bb8a4d168d58 build hacks
> 26e04c84de6c kevin: build stuff
> 6915015e30db kevin: defconfig
> 4a3928c6f8a5 (tag: v4.16-rc3) Linux 4.16-rc3
> 
> 
> Actually, I also see some errors reading thermal channels, but I
> wonder if perhaps USB is causing some sort of interrupt storm and that
> happens to randomly take out whatever device was trying to talk at the
> same time?
> 
> [   83.718999] read channel() error: -110
> [   83.723275] thermal thermal_zone3: failed to read out thermal zone (-110)
> [   83.822810] read channel() error: -110
> [   83.827048] thermal thermal_zone2: failed to read out thermal zone (-110)
> [   84.862810] read channel() error: -110
> [   84.867051] thermal thermal_zone3: failed to read out thermal zone (-110)
> [   84.990800] read channel() error: -110
> [   84.995034] thermal thermal_zone2: failed to read out thermal zone (-110)
> [   86.110817] read channel() error: -110
> [   86.115259] thermal thermal_zone3: failed to read out thermal zone (-110)
> [   86.214990] read channel() error: -110
> [   86.219440] thermal thermal_zone2: failed to read out thermal zone (-110)
> [   87.295049] read channel() error: -110
> [   87.299316] thermal thermal_zone3: failed to read out thermal zone (-110)
> [   87.398805] read channel() error: -110
> [   87.403040] thermal thermal_zone2: failed to read out thermal zone (-110)
> [   88.510808] read channel() error: -110
> [   88.515053] thermal thermal_zone2: failed to read out thermal zone (-110)
> [   88.646810] read channel() error: -110
> [   88.651250] thermal thermal_zone3: failed to read out thermal zone (-110)
> [   89.874606] xhci-hcd xhci-hcd.2.auto: Abort failed to stop command ring:
> -110 [   89.896134] xhci-hcd xhci-hcd.2.auto: xHCI host controller not
> responding, assume dead
> [   89.905130] xhci-hcd xhci-hcd.2.auto: HC died; cleaning up
> [   89.911491] xhci-hcd xhci-hcd.2.auto: Timeout while waiting for
> configure endpoint command
> 
> > I use my kevin as a "real" laptop, which means it gets suspended/resumed
> > at least 20 times a day, no reboots involved (unless I'm actually testing
> > arm64 code on it).
> 
> Enric: I know you've been working with Kevin stuff a lot.  Any chance
> you reproduce Marc's failures and also see that it's fixed with his
> hack patch?
> 
> Marc: have you posted actual logs for the failing case (after picking
> my dts fix) somewhere?

just to interject, I'd guess that this is more an issue on Marc's branch?
I.e. I was testing the big hunk of analogix patches just now and did try
suspending as well and apart from the obvious fixes for spi / clk
the system suspends + resumes and I also get the keyboard back
in working condition after resume. Of course my nfsroot still dies
so I cannot test multiple suspend cycles, but at least once per boot
it works.

Test-branch in question is at
https://github.com/mmind/linux-rockchip/tree/tmp/testing_20180312

Base is my devel/upstream, which is master + my for-next + drmmisc
+ stuff that already went into some maintainer tree and looked necessary.

Heiko
--
To unsubscribe from this list: send the line "unsubscribe devicetree" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html