Re: TPM operation times out (very rarely)

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Fri, Feb 21, 2025 at 12:44:45PM +0000, Jonathan McDowell wrote:
> On Thu, Feb 20, 2025 at 09:42:28AM +0100, Michal Suchánek wrote:
> > On Wed, Feb 19, 2025 at 10:29:45PM +0000, Jonathan McDowell wrote:
> > > On Wed, Jan 29, 2025 at 04:27:15PM +0100, Michal Suchánek wrote:
> > > > Hello,
> > > > 
> > > > there is a problem report that booting a specific type of system about
> > > > 0.1% of the time encrypted volume (using a PCR to release the key) fails
> > > > to unlock because of TPM operation timeout.
> > > > 
> > > > Minimizing the test case failed so far.
> > > > 
> > > > For example, booting into text mode as opposed to graphical desktop
> > > > makes the problem unreproducible.
> > > > 
> > > > The test is done with a frankenkernel that has TPM drivers about on par
> > > > with Linux 6.4 but using actual Linux 6.4 the problem is not
> > > > reproducible, either.
> > > > 
> > > > However, given the problem takes up to a day to reproduce I do not have
> > > > much confidence in the negative results.
> > > 
> > > Michal, can you possibly try the below and see if it helps out? There
> > > seems to be a timing bug introduced in 6.4+ that I think might be
> > > related, and matches up with some of our internal metrics that showed an
> > > increase in timeouts in 6.4 onwards.
> > 
> > Thanks for looking into this
> 
> No problem. It's something we've seen in our fleet and I've been trying
> to get to the bottom of, so having some additional data from someone
> else is really helpful.
> 
> > > commit 79041fba797d0fe907e227012767f56dd93fac32
> > > Author: Jonathan McDowell <noodles@xxxxxxxx>
> > > Date:   Wed Feb 19 16:20:44 2025 -0600
> > > 
> > >     tpm, tpm_tis: Fix timeout handling when waiting for TPM status
> > >     
> > >     The change to only use interrupts to handle supported status changes,
> > >     then switch to polling for the rest, inverted the status test and sleep
> > >     such that we can end up sleeping beyond our timeout and not actually
> > >     checking the status. This can result in spurious TPM timeouts,
> > >     especially on a more loaded system. Fix by switching the order back so
> > >     we sleep *then* check. We've done a up front check when we enter the
> > >     function so this won't cause an additional delay when the status is
> > >     already what we're looking for.
> > >     
> > >     Cc: stable@xxxxxxxxxxxxxxx # v6.4+
> > >     Fixes: e87fcf0dc2b4 ("tpm, tpm_tis: Only handle supported interrupts")
> > >     Signed-off-by: Jonathan McDowell <noodles@xxxxxxxx>
> > > 
> > > diff --git a/drivers/char/tpm/tpm_tis_core.c b/drivers/char/tpm/tpm_tis_core.c
> > > index fdef214b9f6b..167d71747666 100644
> > > --- a/drivers/char/tpm/tpm_tis_core.c
> > > +++ b/drivers/char/tpm/tpm_tis_core.c
> > > @@ -114,11 +114,11 @@ static int wait_for_tpm_stat(struct tpm_chip *chip, u8 mask,
> > >  		return 0;
> > >  	/* process status changes without irq support */
> > >  	do {
> > > +		usleep_range(priv->timeout_min,
> > > +			     priv->timeout_max);
> > 
> > What would be the priv->timeout_min and priv->timeout_max here?
> > 
> > Note that there are timeouts that are 200ms, and are overblown by 2s.
> > 
> > If the 200ms timeout relies on the sleep during the wait for the timeout
> > being much longer than the timeout itself then the timeout is arguably
> > bogus regardless of this change helping.
> 
> Ah, I thought your major issue was the 2s timeout that was only slightly
> exceeded.
> 
> However in my initial tracing I've seen wait_for_tpm_stat take much
> longer than the timeout that's passed in, which is what caused me to go
> and investigate this code path and note it had been changed in 6.4. It
> seems like a bug either way, but I've been at the TCG meeting this week
> and not had time to do further instrumentation and confirmation. Given
> you seem to have a more reliable reproducer I thought it might be easy
> enough for you to see if it made any difference.

The problem is no longer reproducible, probably due to some other hcange
in the test environment. So much for reliable reproducer.

Yes, I think this is a bug either way and should be addressed although
the effect on this problem is minor at best.

Thanks

Michal




[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Index of Archives]     [Linux Kernel]     [Linux Kernel Hardening]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux SCSI]

  Powered by Linux