Re: Kernel panic on Google Pixel devices due to regulator patch

Greg KH <gregkh@xxxxxxxxxxxxxxxxxxx> · Wed, 18 Dec 2019 15:22:19 +0100

On Wed, Dec 18, 2019 at 01:11:14PM +0000, Mark Brown wrote:
> On Wed, Dec 18, 2019 at 01:21:57PM +0100, Greg KH wrote:
> > On Wed, Dec 18, 2019 at 11:34:58AM +0000, Mark Brown wrote:
> > > On Tue, Dec 17, 2019 at 11:51:55PM +0800, Siddharth Kapoor wrote:
> 
> > > > I would like to share a concern with the regulator patch which is part of
> > > > 4.9.196 LTS kernel.
> 
> > > That's an *extremely* old kernel.
> 
> > It is, but it's the latest stable kernel (well close to), and your patch
> > was tagged by you to be backported to here, so if there's a problem with
> > a stable branch, I want to know about it as I don't want to see
> > regressions happen in it.
> 
> I don't track what's in older stable kernels, it wanted to go back at
> least one kernel revision but the issue has been around since forever.

Ok, you can always mark patches that way if you want to :)

> > > I've got nothing to do with the stable kernels so there's nothing I can
> > > do here, sorry.
> 
> > Should I revert it everywhere?  This patch reads as it should be fixing
> > problems, not causing them :)
> 
> The main targets were whatever Debian and Ubuntu are shipping (and to a
> lesser extent SuSE or RHEL but they don't use stable directly), it's
> less relevant to anything that only gets used on embedded stuff.  It's
> right on the knife edge of what I'd backport but since that's way less
> enthusiastic than stable is in general these days.

I've reverted it now from 4.14.y and 4.9.y.

> > > Possibly your GPU supplies need to be flagged as always on, possibly
> > > your GPU driver is forgetting to enable some supplies it needs, or
> > > possibly there's a missing always-on constraint on one of the regulators
> > > depending on how the driver expects this to work (if it's a proprietary
> > > driver it shouldn't be using the regulator API itself).  I'm quite
> > > surprised you've not seen any issue before given that the supplies would
> > > still be being disabled earlier.
> 
> > Timing "luck" is probably something we shouldn't be messing with in
> > stable kernels.  How about I revert this for the 4.14 and older releases
> > and let new devices deal with the timing issues when they are brought up
> > on new hardware?
> 
> To be clear this is more a straight up bug in their stuff than the sort
> of thing you'd normally think of as a race condition, we're talking
> about moving the timing by 30 seconds here.  The case that we saw
> already was just a clear and obvious bug that was made more visible (the
> driver was using the wrong name for a supply so lookups were always
> failing but some sequence of events meant it didn't produce big runtime
> failures).
> 
> If you don't want to be messing with timing luck then you probably want
> to be having a look at what Sasha's bot is doing, it's picking up a lot
> of things that are *well* into this sort of territory (and the bad
> interactions with out of tree code territory).  I personally would not
> be using stable these days if I wasn't prepared to be digging into
> something like this.

I watch what his bot is doing, and we have tons of testing happening as
well, which is reflected by the fact that THIS WAS CAUGHT HERE.  This is
a sign that things are working, it's just that some SoC trees are slower
than mainline by a few months, and that's fine.  It's worlds better than
the SoC trees that are no where close to mainline, and as such, totally
insecure :)

thanks,

greg k-h