Re: [PATCH] Fix Atmel TPM crash caused by too frequent queries

James Bottomley <James.Bottomley@xxxxxxxxxxxxxxxxxxxxx> · Thu, 01 Oct 2020 11:32:59 -0700

On Thu, 2020-10-01 at 14:15 -0400, Nayna wrote:
> On 10/1/20 12:53 AM, James Bottomley wrote:
> > On Thu, 2020-10-01 at 04:50 +0300, Jarkko Sakkinen wrote:
> > > On Wed, Sep 30, 2020 at 03:31:20PM -0700, James Bottomley wrote:
> > > > On Thu, 2020-10-01 at 00:09 +0300, Jarkko Sakkinen wrote:
[...]
> > > > > I also wonder if we could adjust the frequency dynamically.
> > > > > I.e. start with optimistic value and lower it until finding
> > > > > the sweet spot.
> > > >  
> > > > The problem is the way this crashes: the TPM seems to be
> > > > unrecoverable. If it were recoverable without a hard reset of
> > > > the entire machine, we could certainly play around with it.  I
> > > > can try alternative mechanisms to see if anything's viable, but
> > > > to all intents and purposes, it looks like my TPM simply stops
> > > > responding to the TIS interface.
> > >  
> > > A quickly scraped idea probably with some holes in it but I was
> > > thinking something like
> > > 
> > > 1. Initially set slow value for latency, this could be the
> > > original 15 ms.
> > > 2. Use this to read TPM_PT_VENDOR_STRING_*.
> > > 3. Lookup based vendor string from a fixup table a latency that
> > > works
> > >     (the fallback latency could be the existing latency).
> >  
> > Well, yes, that was sort of what I was thinking of doing for the
> > Atmel ... except I was thinking of using the TIS VID (16 byte
> > assigned vendor ID) which means we can get the information to set
> > the timeout before we have to do any TPM operations.
> 
> I wonder if the timeout issue exists for all TPM commands for the
> same manufacturer.  For example, does the ATMEL TPM also crash when 
> extending  PCRs ?
> 
> In addition to defining a per TPM vendor based lookup table for
> timeout, would it be a good idea to also define a Kconfig/boot param
> option to allow timeout setting.  This will enable to set the timeout
> based on the specific use.

I don't think we need go that far (yet).  The timing change has been in
upstream since:

commit 424eaf910c329ab06ad03a527ef45dcf6a328f00
Author: Nayna Jain <nayna@xxxxxxxxxxxxxxxxxx>
Date:   Wed May 16 01:51:25 2018 -0400

    tpm: reduce polling time to usecs for even finer granularity

Which was in the released kernel 4.18: over two years ago.  In all that
time we've discovered two problems: mine which looks to be an artifact
of an experimental upgrade process in a new nuvoton and the Atmel. 
That means pretty much every other TPM simply works with the existing
timings

> I was also thinking how will we decide the lookup table values for
> each vendor ?

I wasn't thinking we would.  I was thinking I'd do a simple exception
for the Atmel and nothing else.  I don't think my Nuvoton is in any way
characteristic.  Indeed my pluggable TPM rainbow bridge system works
just fine with a Nuvoton and the current timings.

We can add additional exceptions if they actually turn up.

James