Re: [PATCH] Fix Atmel TPM crash caused by too frequent queries

Nayna <nayna@xxxxxxxxxxxxxxxxxx> · Thu, 1 Oct 2020 14:15:13 -0400

On 10/1/20 12:53 AM, James Bottomley wrote:
On Thu, 2020-10-01 at 04:50 +0300, Jarkko Sakkinen wrote:
On Wed, Sep 30, 2020 at 03:31:20PM -0700, James Bottomley wrote:
On Thu, 2020-10-01 at 00:09 +0300, Jarkko Sakkinen wrote:
On Wed, Sep 30, 2020 at 01:48:15PM -0700, James Bottomley wrote:
On Wed, 2020-09-30 at 18:37 +0300, Jarkko Sakkinen wrote:
On Wed, Sep 30, 2020 at 07:54:58AM -0700, James Bottomley
wrote:
On Wed, 2020-09-30 at 05:16 +0300, Jarkko Sakkinen wrote:
On Mon, Sep 28, 2020 at 03:11:39PM -0700, James Bottomley
wrote:
On Sun, 2020-09-27 at 22:59 -0700, Hao Wu wrote:
[...]
However, there is another possibility: it's
something to do with the byte read; I notice you
don't require the same slowdown for the burst count
read, which actually reads the status register and
burst count as a read32.  If that really is the
case, for the atmel would substituting a read32 and
just throwing the upper bytes away in
tpm_tis_status() allow us to keep the current
timings?  I can actually try doing this and see if
it fixes my nuvoton.
If would be helpful if you can find the solution
without reducing performance. I think it is a
separate problem to address though. Maybe not worth
to mix them in the same fix.
Well, if it works, no other fix is needed.

This is what I'm currently trying out on my nuvoton
with the timings reverted to being those in the vanilla
kernel.  So far it hasn't crashed, but I haven't run it
for long enough to be sure yet.

James
OK, so the bus does not like one byte reads but prefers
full (32-bit) word reads? I.e. what's the context?
It's not supported by anything in the spec just empirical
observation.  However, the spec says the status register is
24 bits: the upper 16 being the burst count.  When we read
the whole status register, including the burst count, we do
a read32. I observed that the elongated timing was only
added for the read8 code not the read32 which supports the
theory that the former causes the Atmel to crash but the
latter doesn't.  Of course it's always possible that
probabilistically the Atmel is going to crash on the burst
count read, but that's exercised far less than the status
only read.
This paragraph is good enough explanation for me. Can you
include it to the final commit as soon as we hear how your
fix works for Hao?
Sure.  I'm afraid I have to report that it didn't work for
me.  My Nuvoton is definitely annoyed by the frequency of the
prodding rather than the register width.
Sorry, this might have been stated at some point but what type of
bus is it connected with?
It's hard to tell: this is my Dell Laptop, but I'd have to bet LPC.

Does it help in any way to tune the frequency?
specific memory mapped address and all the conversion to the LPC
back end is done by memory read/write operations.  The TPM itself
has a clock but doesn't give the TIS interface software control.
Some TPM's use tpm_tis_spi instead of MMIO.
Yes, but I'm fairly certain mine's not SPI.

I also wonder if we could adjust the frequency dynamically. I.e.
start with optimistic value and lower it until finding the sweet
spot.
The problem is the way this crashes: the TPM seems to be
unrecoverable. If it were recoverable without a hard reset of the
entire machine, we could certainly play around with it.  I can try
alternative mechanisms to see if anything's viable, but to all
intents and purposes, it looks like my TPM simply stops responding
to the TIS interface.
A quickly scraped idea probably with some holes in it but I was
thinking something like

1. Initially set slow value for latency, this could be the original
15 ms.
2. Use this to read TPM_PT_VENDOR_STRING_*.
3. Lookup based vendor string from a fixup table a latency that works
    (the fallback latency could be the existing latency).
Well, yes, that was sort of what I was thinking of doing for the Atmel
... except I was thinking of using the TIS VID (16 byte assigned vendor
ID) which means we can get the information to set the timeout before we
have to do any TPM operations.

I wonder if the timeout issue exists for all TPM commands for the same 
manufacturer.  For example, does the ATMEL TPM also crash when 
extending  PCRs ?

In addition to defining a per TPM vendor based lookup table for timeout, 
would it be a good idea to also define a Kconfig/boot param option to 
allow timeout setting.  This will enable to set the timeout based on the 
specific use.

I was also thinking how will we decide the lookup table values for each 
vendor ?

Thanks & Regards,

      - Nayna