Re: REGRESSION WITH BISECT: v6.5-rc6 TPM patch breaks S3 on some Intel systems

Todd Brandt <todd.e.brandt@xxxxxxxxxxxxxxx> · Fri, 18 Aug 2023 14:39:58 -0700

On Fri, 2023-08-18 at 13:11 -0500, Mario Limonciello wrote:
> On 8/18/2023 13:07, Jarkko Sakkinen wrote:
> > On Fri Aug 18, 2023 at 8:57 PM EEST, Mario Limonciello wrote:
> > > On 8/18/2023 12:53, Jarkko Sakkinen wrote:
> > > > On Fri Aug 18, 2023 at 8:21 PM EEST, Mario Limonciello wrote:
> > > > > On 8/18/2023 12:00, Jarkko Sakkinen wrote:
> > > > > > On Fri Aug 18, 2023 at 4:58 AM EEST, Limonciello, Mario
> > > > > > wrote:
> > > > > > > 
> > > > > > > 
> > > > > > > On 8/17/2023 5:33 PM, Jarkko Sakkinen wrote:
> > > > > > > > On Fri Aug 18, 2023 at 1:25 AM EEST, Todd Brandt wrote:
> > > > > > > > > On Fri, 2023-08-18 at 00:47 +0300, Jarkko Sakkinen
> > > > > > > > > wrote:
> > > > > > > > > > On Fri Aug 18, 2023 at 12:09 AM EEST, Todd Brandt
> > > > > > > > > > wrote:
> > > > > > > > > > > While testing S3 on 6.5.0-rc6 we've found that 5
> > > > > > > > > > > systems are seeing
> > > > > > > > > > > a
> > > > > > > > > > > crash and reboot situation when S3 suspend is
> > > > > > > > > > > initiated. To
> > > > > > > > > > > reproduce
> > > > > > > > > > > it, this call is all that's required "sudo
> > > > > > > > > > > sleepgraph -m mem
> > > > > > > > > > > -rtcwake
> > > > > > > > > > > 15".
> > > > > > > > > > 
> > > > > > > > > > 1. Are there logs available?
> > > > > > > > > > 2. Is this the test case: 
> > > > > > > > > > https://pypi.org/project/sleepgraph/ (never
> > > > > > > > > > used it before).
> > > > > > > > > 
> > > > > > > > > There are no dmesg logs because the S3 crash wipes
> > > > > > > > > them out. Sleepgraph
> > > > > > > > > isn't actually necessary to activate it, just an S3
> > > > > > > > > suspend "echo mem >
> > > > > > > > > /sys/power/state".
> > > > > > > > > 
> > > > > > > > > So far it appears to only have affected test systems,
> > > > > > > > > not production
> > > > > > > > > hardware, and none of them have TPM chips, so I'm
> > > > > > > > > beginning to wonder
> > > > > > > > > if this patch just inadvertently activated a bug
> > > > > > > > > somewhere else in the
> > > > > > > > > kernel that happens to affect test hardware.
> > > > > > > > > 
> > > > > > > > > I'll continue to debug it, this isn't an emergency as
> > > > > > > > > so far I haven't
> > > > > > > > > seen it in production hardware.
> > > > > > > > 
> > > > > > > > OK, I'll still see if I could reproduce it just in
> > > > > > > > case.
> > > > > > > > 
> > > > > > > > BR, Jarkko
> > > > > > > 
> > > > > > > I'd like to better understand what kind of TPM
> > > > > > > initialization path has
> > > > > > > run.  Does the machine have some sort of TPM that failed
> > > > > > > to fully
> > > > > > > initialize perhaps?
> > > > > > > 
> > > > > > > If you can't share a full bootup dmesg, can you at least
> > > > > > > share
> > > > > > > 
> > > > > > > # dmesg | grep -i tpm
> > > > > > 
> > > > > > It would be more useful perhaps to get full dmesg output
> > > > > > after power on
> > > > > > and before going into suspend.
> > > > > > 
> > > > > > Also ftrace filter could be added to the kernel command-
> > > > > > line:
> > > > > > 
> > > > > > ftrace=function ftrace_filter=tpm*
> > > > > > 
> > > > > > After bootup:
> > > > > > 
> > > > > > mount -t tracefs nodev /sys/kernel/tracing
> > > > > > cat /sys/kernel/tracing/trace
> > > > > > 
> > > > > > BR, Jarkko
> > > > > 
> > > > > Todd and I have gone back and forth a little bit on the
> > > > > bugzilla
> > > > > (https://bugzilla.kernel.org/show_bug.cgi?id=217804), and it
> > > > > seems that
> > > > > this isn't an S3 problem - it's a probing problem.
> > > > > 
> > > > > [    1.132521] tpm_crb: probe of INTC6001:00 failed with
> > > > > error 378
> > > > > 
> > > > > That error 378 specifically matches TPM2_CC_GET_CAPABILITY,
> > > > > which is the
> > > > > same command that was being requested.  This leads me to
> > > > > believe the TPM
> > > > > isn't ready at the time of probing.
> > > > > 
> > > > > In this case one solution is we could potentially ignore
> > > > > failures for
> > > > > that tpm2_get_tpm_pt() call, but I think we should first
> > > > > understand why
> > > > > it doesn't work at probing time for this TPM to ensure the
> > > > > actual quirk
> > > > > isn't built on a house of cards.
> > > > 
> > > > Given that there is nothing known broken (at the moment) in
> > > > production,
> > > > I think the following might be a reasonable change.
> > > > 
> > > > BR, Jarkko
> > > > 
> > > 
> > > Yeah that would prevent it.
> > > 
> > > Here's a simpler change that I think should work too though:
> > > diff --git a/drivers/char/tpm/tpm_crb.c
> > > b/drivers/char/tpm/tpm_crb.c
> > > index 9eb1a18590123..b0e9931fe436c 100644
> > > --- a/drivers/char/tpm/tpm_crb.c
> > > +++ b/drivers/char/tpm/tpm_crb.c
> > > @@ -472,8 +472,7 @@ static int crb_check_flags(struct tpm_chip
> > > *chip)
> > >           if (ret)
> > >                   return ret;
> > > 
> > > -       ret = tpm2_get_tpm_pt(chip, TPM2_PT_MANUFACTURER, &val,
> > > NULL);
> > > -       if (ret)
> > > +       if (tpm2_get_tpm_pt(chip, TPM2_PT_MANUFACTURER, &val,
> > > NULL))
> > >                   goto release;
> > > 
> > >           if (val == 0x414D4400U /* AMD */)
> > > 
> > > I think Todd needs to check whether TPM works with that or not
> > > though.
> > 
> > Hmm... I'm sorry if I have a blind spot now but what is that
> > changing?
> > 
> > BR, Jarkko
> 
> It throws away the error code if it fails for some reason.
> Todd just checked it works too.  I'll drop it on the M/L for review.

I just ran 6.5.0-rc6 plus this patch on all 5 machines where the
problem was detected and they work now. It looks good.

Tested-by: Todd Brandt <todd.e.brandt@xxxxxxxxxxxxxxx>