Re: [PATCH] tpm: Make timeout logic simpler and more robust

James Bottomley <James.Bottomley@xxxxxxxxxxxxxxxxxxxxx> · Mon, 11 Mar 2019 17:27:43 -0700

On Mon, 2019-03-11 at 16:54 -0700, Calvin Owens wrote:
> e're having lots of problems with TPM commands timing out, and we're
> seeing these problems across lots of different hardware (both v1/v2).
> 
> I instrumented the driver to collect latency data, but I wasn't able
> to find any specific timeout to fix: it seems like many of them are
> too aggressive. So I tried replacing all the timeout logic with a
> single universal long timeout, and found that makes our TPMs 100%
> reliable.
> 
> Given that this timeout logic is very complex, problematic, and
> appears to serve no real purpose, I propose simply deleting all of
> it.

"no real purpose" is a bit strong given that all these timeouts are
standards mandated.  The purpose stated by the standards is that there
needs to be a way of differentiating the TPM crashed from the TPM is
taking a very long time to respond.  For a normally functioning TPM it
looks complex and unnecessary, but for a malfunctioning one it's a
lifesaver.

Could you first check it's not a problem we introduced with our polling
changes?  My nuvoton still doesn't work properly with the default poll
timings but it works flawlessly if I use the patch below.  I think my
nuvoton is a bit out of spec (it's a very early model that was software
upgraded from 1.2 to 2.0) because no-one else on the list seems to see
the problems I see, but perhaps you are.

James

---

>From 249d60a9fafa8638433e545b50dab6987346cb26 Mon Sep 17 00:00:00 2001
From: James Bottomley <James.Bottomley@xxxxxxxxxxxxxxxxxxxxx>
Date: Wed, 11 Jul 2018 10:11:14 -0700
Subject: [PATCH] tpm.h: increase poll timings to fix tpm_tis regression

tpm_tis regressed recently to the point where the TPM being driven by
it falls off the bus and cannot be contacted after some hours of use.
This is the failure trace:

jejb@jarvis:~> dmesg|grep tpm
[    3.282605] tpm_tis MSFT0101:00: 2.0 TPM (device-id 0xFE, rev-id 2)
[14566.626614] tpm tpm0: Operation Timed out
[14566.626621] tpm tpm0: tpm2_load_context: failed with a system error -62
[14568.626607] tpm tpm0: tpm_try_transmit: tpm_send: error -62
[14570.626594] tpm tpm0: tpm_try_transmit: tpm_send: error -62
[14570.626605] tpm tpm0: tpm2_load_context: failed with a system error -62
[14572.626526] tpm tpm0: tpm_try_transmit: tpm_send: error -62
[14577.710441] tpm tpm0: tpm_try_transmit: tpm_send: error -62
...

The problem is caused by a change that caused us to poke the TPM far
more often to see if it's ready.  Apparently something about the bus
its on and the TPM means that it crashes or falls off the bus if you
poke it too often and once this happens, only a reboot will recover
it.

The fix I've come up with is to adjust the timings so the TPM no
longer falls of the bus.  Obviously, this fix works for my Nuvoton
NPCT6xxx but that's the only TPM I've tested it with.

Fixes: 424eaf910c32 tpm: reduce polling time to usecs for even finer granularity
Signed-off-by: James Bottomley <James.Bottomley@xxxxxxxxxxxxxxxxxxxxx>

diff --git a/drivers/char/tpm/tpm.h b/drivers/char/tpm/tpm.h
index 4b104245afed..a6c806d98950 100644
--- a/drivers/char/tpm/tpm.h
+++ b/drivers/char/tpm/tpm.h
@@ -64,8 +64,8 @@ enum tpm_timeout {
 	TPM_TIMEOUT_RETRY = 100, /* msecs */
 	TPM_TIMEOUT_RANGE_US = 300,	/* usecs */
 	TPM_TIMEOUT_POLL = 1,	/* msecs */
-	TPM_TIMEOUT_USECS_MIN = 100,      /* usecs */
-	TPM_TIMEOUT_USECS_MAX = 500      /* usecs */
+	TPM_TIMEOUT_USECS_MIN = 750,      /* usecs */
+	TPM_TIMEOUT_USECS_MAX = 1000,      /* usecs */
 };
 
 /* TPM addresses */