Re: TPM operation times out (very rarely)

Michal Suchánek <msuchanek@xxxxxxx> · Fri, 7 Feb 2025 10:40:16 +0100

On Fri, Feb 07, 2025 at 09:26:16AM +0000, Jonathan McDowell wrote:
> On Thu, Feb 06, 2025 at 10:35:32PM +0200, Jarkko Sakkinen wrote:
> > On Wed, Feb 05, 2025 at 02:29:36PM +0000, Jonathan McDowell wrote:
> > > Interesting. TPM2_CC_CONTEXT_LOAD (353) / TPM2_CC_FLUSH_CONTEXT (357) /
> > > TPM2_CC_CONTEXT_SAVE (354) I kinda expect to maybe take a bit longer,
> > > but TPM2_CC_GET_RANDOM (379) is a little surprising.
> > 
> > The whole arithmetic with timeout_a/b/c is mostly gibberish and could
> > be replaced with a single "max" constant without issues (just set it
> > large enough).
> > 
> > They could be all be replaced with let's say 3s timeout in a constant.
> 
> This appears to have come up before:
> 
> https://lore.kernel.org/linux-integrity/358e89ed2b766d51b5f57abf31ab7a925ac63379.1552348123.git.calvinowens@xxxxxx/
> 
> That patch was deemed overly complex and it was suggested to split it
> up; I can't find any indication that was ever done which I guess is why
> the discussion died off.

Also it comes before TPM_LONG_LONG, it sounds like that would still be
special.

> 
> > > > Failure is observed with another chip type as well:
> > > > 
> > > > localhost kernel: tpm_tis MSFT0101:00: 2.0 TPM (device-id 0x1B, rev-id
> > > > 22)
> > > > 
> > > > TPM Device
> > > >         Vendor ID: IFX
> > > >         Specification Version: 2.0
> > > >         Firmware Revision: 7.83
> > > >         Description: TPM 2.0, ManufacturerID: IFX , Firmware Version: 7.83.3358.0
> > > >         Characteristics:
> > > >                 Family configurable via firmware update
> > > >                 Family configurable via OEM proprietary mechanism
> > > >         OEM-specific Information: 0x00000000
> > > 
> > > That looks like an SLB9670, not running the latest firmware (7.85). I
> > > think that might have the errata I've been trying to work around; my
> > > current patch in testing (with logging to see how effective it is):
> > > 
> > > commit d8c680ec34e7f42f731e7d64605a670fb7b3b4d1
> > > Author: Jonathan McDowell <noodles@xxxxxxxx>
> > > Date:   Mon Aug 19 09:22:46 2024 -0700
> > > 
> > >     tpm: Workaround failed command reception on Infineon devices
> > >     
> > >     Some Infineon devices have a issue where the status register will get
> > >     stuck with a quick REQUEST_USE / COMMAND_READY sequence. The work around
> > >     is to retry the command submission. Add appropriate logic to do this in
> > >     the send path.
> > >     
> > > diff --git a/drivers/char/tpm/tpm_tis_core.c b/drivers/char/tpm/tpm_tis_core.c
> > > index fdef214b9f6b..561d6801e299 100644
> > > --- a/drivers/char/tpm/tpm_tis_core.c
> > > +++ b/drivers/char/tpm/tpm_tis_core.c
> > > @@ -464,7 +464,12 @@ static int tpm_tis_send_data(struct tpm_chip *chip, const u8 *buf, size_t len)
> > >  
> > >  		if (wait_for_tpm_stat(chip, TPM_STS_VALID, chip->timeout_c,
> > >  					&priv->int_queue, false) < 0) {
> > > -			rc = -ETIME;
> > > +			if (test_bit(TPM_TIS_STATUS_WORKAROUND, &priv->flags)) {
> > > +				dev_err(&chip->dev, "Timed out waiting for status valid in send, retrying\n");
> > > +				rc = -EAGAIN;
> > 
> > 
> > I'm not sure why wait_for_tpm_stat() return value is ignored but it
> > should not be like that. E.g. it can return -ERESTARTSYS. Probably
> > would be better to check all the call sites for it that they do
> > same thing.
> > 
> > I.e. rc = wait_for_tpm_stat(...);
> > 
> > /* ... */
> 
> So just to clarify, this more recent patch is working around a situation
> where the status register gets stuck and needs a complete retry of the
> command send - it's an Infineon errata, not something that would be
> fixed with a longer timeout.
> 
> We see what looks like Michal's issue with timeouts as well, I just
> haven't made the step of instrumenting it all the way he has.

And I haven't seen the issue that needs re-sending the command so far.

Maybe it happens even less frequently than the excessive processing
time.

Fully restarting the syscall would fix both issues but manual restart of
the userspace task reportedly did not work so I have my doubts that
this method with returning from the syscall would be effective.

Thanks

Michal