On Fri, 2023-08-18 at 13:11 -0500, Mario Limonciello wrote: > On 8/18/2023 13:07, Jarkko Sakkinen wrote: > > On Fri Aug 18, 2023 at 8:57 PM EEST, Mario Limonciello wrote: > > > On 8/18/2023 12:53, Jarkko Sakkinen wrote: > > > > On Fri Aug 18, 2023 at 8:21 PM EEST, Mario Limonciello wrote: > > > > > On 8/18/2023 12:00, Jarkko Sakkinen wrote: > > > > > > On Fri Aug 18, 2023 at 4:58 AM EEST, Limonciello, Mario > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > On 8/17/2023 5:33 PM, Jarkko Sakkinen wrote: > > > > > > > > On Fri Aug 18, 2023 at 1:25 AM EEST, Todd Brandt wrote: > > > > > > > > > On Fri, 2023-08-18 at 00:47 +0300, Jarkko Sakkinen > > > > > > > > > wrote: > > > > > > > > > > On Fri Aug 18, 2023 at 12:09 AM EEST, Todd Brandt > > > > > > > > > > wrote: > > > > > > > > > > > While testing S3 on 6.5.0-rc6 we've found that 5 > > > > > > > > > > > systems are seeing > > > > > > > > > > > a > > > > > > > > > > > crash and reboot situation when S3 suspend is > > > > > > > > > > > initiated. To > > > > > > > > > > > reproduce > > > > > > > > > > > it, this call is all that's required "sudo > > > > > > > > > > > sleepgraph -m mem > > > > > > > > > > > -rtcwake > > > > > > > > > > > 15". > > > > > > > > > > > > > > > > > > > > 1. Are there logs available? > > > > > > > > > > 2. Is this the test case: > > > > > > > > > > https://pypi.org/project/sleepgraph/ (never > > > > > > > > > > used it before). > > > > > > > > > > > > > > > > > > There are no dmesg logs because the S3 crash wipes > > > > > > > > > them out. Sleepgraph > > > > > > > > > isn't actually necessary to activate it, just an S3 > > > > > > > > > suspend "echo mem > > > > > > > > > > /sys/power/state". > > > > > > > > > > > > > > > > > > So far it appears to only have affected test systems, > > > > > > > > > not production > > > > > > > > > hardware, and none of them have TPM chips, so I'm > > > > > > > > > beginning to wonder > > > > > > > > > if this patch just inadvertently activated a bug > > > > > > > > > somewhere else in the > > > > > > > > > kernel that happens to affect test hardware. > > > > > > > > > > > > > > > > > > I'll continue to debug it, this isn't an emergency as > > > > > > > > > so far I haven't > > > > > > > > > seen it in production hardware. > > > > > > > > > > > > > > > > OK, I'll still see if I could reproduce it just in > > > > > > > > case. > > > > > > > > > > > > > > > > BR, Jarkko > > > > > > > > > > > > > > I'd like to better understand what kind of TPM > > > > > > > initialization path has > > > > > > > run. Does the machine have some sort of TPM that failed > > > > > > > to fully > > > > > > > initialize perhaps? > > > > > > > > > > > > > > If you can't share a full bootup dmesg, can you at least > > > > > > > share > > > > > > > > > > > > > > # dmesg | grep -i tpm > > > > > > > > > > > > It would be more useful perhaps to get full dmesg output > > > > > > after power on > > > > > > and before going into suspend. > > > > > > > > > > > > Also ftrace filter could be added to the kernel command- > > > > > > line: > > > > > > > > > > > > ftrace=function ftrace_filter=tpm* > > > > > > > > > > > > After bootup: > > > > > > > > > > > > mount -t tracefs nodev /sys/kernel/tracing > > > > > > cat /sys/kernel/tracing/trace > > > > > > > > > > > > BR, Jarkko > > > > > > > > > > Todd and I have gone back and forth a little bit on the > > > > > bugzilla > > > > > (https://bugzilla.kernel.org/show_bug.cgi?id=217804), and it > > > > > seems that > > > > > this isn't an S3 problem - it's a probing problem. > > > > > > > > > > [ 1.132521] tpm_crb: probe of INTC6001:00 failed with > > > > > error 378 > > > > > > > > > > That error 378 specifically matches TPM2_CC_GET_CAPABILITY, > > > > > which is the > > > > > same command that was being requested. This leads me to > > > > > believe the TPM > > > > > isn't ready at the time of probing. > > > > > > > > > > In this case one solution is we could potentially ignore > > > > > failures for > > > > > that tpm2_get_tpm_pt() call, but I think we should first > > > > > understand why > > > > > it doesn't work at probing time for this TPM to ensure the > > > > > actual quirk > > > > > isn't built on a house of cards. > > > > > > > > Given that there is nothing known broken (at the moment) in > > > > production, > > > > I think the following might be a reasonable change. > > > > > > > > BR, Jarkko > > > > > > > > > > Yeah that would prevent it. > > > > > > Here's a simpler change that I think should work too though: > > > diff --git a/drivers/char/tpm/tpm_crb.c > > > b/drivers/char/tpm/tpm_crb.c > > > index 9eb1a18590123..b0e9931fe436c 100644 > > > --- a/drivers/char/tpm/tpm_crb.c > > > +++ b/drivers/char/tpm/tpm_crb.c > > > @@ -472,8 +472,7 @@ static int crb_check_flags(struct tpm_chip > > > *chip) > > > if (ret) > > > return ret; > > > > > > - ret = tpm2_get_tpm_pt(chip, TPM2_PT_MANUFACTURER, &val, > > > NULL); > > > - if (ret) > > > + if (tpm2_get_tpm_pt(chip, TPM2_PT_MANUFACTURER, &val, > > > NULL)) > > > goto release; > > > > > > if (val == 0x414D4400U /* AMD */) > > > > > > I think Todd needs to check whether TPM works with that or not > > > though. > > > > Hmm... I'm sorry if I have a blind spot now but what is that > > changing? > > > > BR, Jarkko > > It throws away the error code if it fails for some reason. > Todd just checked it works too. I'll drop it on the M/L for review. I just ran 6.5.0-rc6 plus this patch on all 5 machines where the problem was detected and they work now. It looks good. Tested-by: Todd Brandt <todd.e.brandt@xxxxxxxxxxxxxxx>