Re: REGRESSION WITH BISECT: v6.5-rc6 TPM patch breaks S3 on some Intel systems

"Jarkko Sakkinen" <jarkko@xxxxxxxxxx> · Fri, 18 Aug 2023 21:07:37 +0300

On Fri Aug 18, 2023 at 8:57 PM EEST, Mario Limonciello wrote:
> On 8/18/2023 12:53, Jarkko Sakkinen wrote:
> > On Fri Aug 18, 2023 at 8:21 PM EEST, Mario Limonciello wrote:
> >> On 8/18/2023 12:00, Jarkko Sakkinen wrote:
> >>> On Fri Aug 18, 2023 at 4:58 AM EEST, Limonciello, Mario wrote:
> >>>>
> >>>>
> >>>> On 8/17/2023 5:33 PM, Jarkko Sakkinen wrote:
> >>>>> On Fri Aug 18, 2023 at 1:25 AM EEST, Todd Brandt wrote:
> >>>>>> On Fri, 2023-08-18 at 00:47 +0300, Jarkko Sakkinen wrote:
> >>>>>>> On Fri Aug 18, 2023 at 12:09 AM EEST, Todd Brandt wrote:
> >>>>>>>> While testing S3 on 6.5.0-rc6 we've found that 5 systems are seeing
> >>>>>>>> a
> >>>>>>>> crash and reboot situation when S3 suspend is initiated. To
> >>>>>>>> reproduce
> >>>>>>>> it, this call is all that's required "sudo sleepgraph -m mem
> >>>>>>>> -rtcwake
> >>>>>>>> 15".
> >>>>>>>
> >>>>>>> 1. Are there logs available?
> >>>>>>> 2. Is this the test case: https://pypi.org/project/sleepgraph/ (never
> >>>>>>> used it before).
> >>>>>>
> >>>>>> There are no dmesg logs because the S3 crash wipes them out. Sleepgraph
> >>>>>> isn't actually necessary to activate it, just an S3 suspend "echo mem >
> >>>>>> /sys/power/state".
> >>>>>>
> >>>>>> So far it appears to only have affected test systems, not production
> >>>>>> hardware, and none of them have TPM chips, so I'm beginning to wonder
> >>>>>> if this patch just inadvertently activated a bug somewhere else in the
> >>>>>> kernel that happens to affect test hardware.
> >>>>>>
> >>>>>> I'll continue to debug it, this isn't an emergency as so far I haven't
> >>>>>> seen it in production hardware.
> >>>>>
> >>>>> OK, I'll still see if I could reproduce it just in case.
> >>>>>
> >>>>> BR, Jarkko
> >>>>
> >>>> I'd like to better understand what kind of TPM initialization path has
> >>>> run.  Does the machine have some sort of TPM that failed to fully
> >>>> initialize perhaps?
> >>>>
> >>>> If you can't share a full bootup dmesg, can you at least share
> >>>>
> >>>> # dmesg | grep -i tpm
> >>>
> >>> It would be more useful perhaps to get full dmesg output after power on
> >>> and before going into suspend.
> >>>
> >>> Also ftrace filter could be added to the kernel command-line:
> >>>
> >>> ftrace=function ftrace_filter=tpm*
> >>>
> >>> After bootup:
> >>>
> >>> mount -t tracefs nodev /sys/kernel/tracing
> >>> cat /sys/kernel/tracing/trace
> >>>
> >>> BR, Jarkko
> >>
> >> Todd and I have gone back and forth a little bit on the bugzilla
> >> (https://bugzilla.kernel.org/show_bug.cgi?id=217804), and it seems that
> >> this isn't an S3 problem - it's a probing problem.
> >>
> >> [    1.132521] tpm_crb: probe of INTC6001:00 failed with error 378
> >>
> >> That error 378 specifically matches TPM2_CC_GET_CAPABILITY, which is the
> >> same command that was being requested.  This leads me to believe the TPM
> >> isn't ready at the time of probing.
> >>
> >> In this case one solution is we could potentially ignore failures for
> >> that tpm2_get_tpm_pt() call, but I think we should first understand why
> >> it doesn't work at probing time for this TPM to ensure the actual quirk
> >> isn't built on a house of cards.
> > 
> > Given that there is nothing known broken (at the moment) in production,
> > I think the following might be a reasonable change.
> > 
> > BR, Jarkko
> > 
>
> Yeah that would prevent it.
>
> Here's a simpler change that I think should work too though:
> diff --git a/drivers/char/tpm/tpm_crb.c b/drivers/char/tpm/tpm_crb.c
> index 9eb1a18590123..b0e9931fe436c 100644
> --- a/drivers/char/tpm/tpm_crb.c
> +++ b/drivers/char/tpm/tpm_crb.c
> @@ -472,8 +472,7 @@ static int crb_check_flags(struct tpm_chip *chip)
>          if (ret)
>                  return ret;
>
> -       ret = tpm2_get_tpm_pt(chip, TPM2_PT_MANUFACTURER, &val, NULL);
> -       if (ret)
> +       if (tpm2_get_tpm_pt(chip, TPM2_PT_MANUFACTURER, &val, NULL))
>                  goto release;
>
>          if (val == 0x414D4400U /* AMD */)
>
> I think Todd needs to check whether TPM works with that or not though.

Hmm... I'm sorry if I have a blind spot now but what is that changing?

BR, Jarkko