I am attaching the original bug report to this thread for new reviewers to get better context --- Hi TPM Driver Maintainers, We are from Rubrik engineering team. We found a TPM driver update since kernel 4.14 causing atmel TPM chips crash. We have root caused it and have a patch on the kernel used by Rubrik products. Given the “bug” has impact on not just Rubrik products, but all atmel TPM chips, we are asking to fix the issue in the kernel upstream. The commit that introduced the bug since 4.14 https://github.com/torvalds/linux/commit/9f3fc7bcddcb51234e23494531f93ab60475e1c3 Effected platform / system: - Cisco UCS C220 M5 with atmel TPM chip - Ubuntu 16.04 - Kernel 4.14 / 4.15 / 4.18 / 4.20 / 5.8 / 5.9-rc4 - Cisco UCS C240 M5 with atmel TPM chip - Ubuntu 16.04 - Kernel 4.14 / 4.15 / 4.18 / 4.20 / 5.8 / 5.9-rc4 ``` # TPM chip used in the problematic platform $ tpm_version TPM 1.2 Version Info: Chip Version: 1.2.66.1 Spec Level: 2 Errata Revision: 3 TPM Vendor ID: ATML TPM Version: 01010000 Manufacturer Info: 41544d4c ``` Not all Cisco server nodes are facing the crash, but there are a good amount of portion (around 50% from nodes in Rubrik) are persistently having the TPM crash issue. Our other platforms using TPM chips from other vendors do not have issue. These platform are running the same software as the problematic platforms run. Those TPM non-effected vendors are - IFX - STM - WEC TPM crash signature: ``` # error when running tpm tool $ tpm_sealdata -z Tspi_Key_LoadKey failed: 0x00001087 - layer=tddl, code=0087 (135), I/O error # tpm driver sends error $ sudo dmesg | grep tpm0 [59154.665549] tpm tpm0: tpm_try_transmit: send(): error -62 [59154.809532] tpm tpm0: tpm_try_transmit: send(): error -62 ``` Our Root Cause Analysis >From the error code “-62”, it looks similar to another bug https://patchwork.kernel.org/patch/10520247/ where the “TPM_TIMEOUT_USECS_MAX” and “TPM_TIMEOUT_USEC_MIN” is too small, which causes TPM get queried too frequently, and thus crashes. The issue for atmel TPM chip crash is similar the commit https://github.com/torvalds/linux/commit/9f3fc7bcddcb51234e23494531f93ab60475e1c3 changed TPM sleep logic from using `msleep` to `tpm_msleep`, in which `usleep_range` is used. As what https://github.com/torvalds/linux/commit/9f3fc7bcddcb51234e23494531f93ab60475e1c3 intended to do, using usleep_range can make the sleep period shorter, because msleep actually sleeps longer when the sleep perioid is within 20ms, and usleep_range can make it more precise. According to our experiments, - usleep_range makes the TPM sleep precisely for TPM_TIMEOUT (i.e. 5ms https://github.com/torvalds/linux/blob/v4.14/drivers/char/tpm/tpm.h#L52) - msleep(TPM_TIMEOUT) actually sleeps around 15ms Thus the TPM get queried more frequently than before, which is likely the root cause of the atmel chip crash. We fix it by bumping up the TPM_TIMEOUT to 15ms. Rubrik Patch (patch only for 4.14) ``` diff --git a/drivers/char/tpm/tpm.h b/drivers/char/tpm/tpm.h index 72d3ce4..9b8f3f8 100644 --- a/drivers/char/tpm/tpm.h +++ b/drivers/char/tpm/tpm.h @@ -49,7 +49,15 @@ enum tpm_const { }; enum tpm_timeout { - TPM_TIMEOUT = 5, /* msecs */ + TPM_TIMEOUT = 15, /* msecs */ TPM_TIMEOUT_RETRY = 100, /* msecs */ TPM_TIMEOUT_RANGE_US = 300 /* usecs */ }; ``` With the patch, the atmel TPM chip crash is fixed. Proposal We want the kernel upstream to adopt the fix or have a better fix for the atmel chip while not bring performance regression for other TPM chips. We understand that https://github.com/torvalds/linux/commit/9f3fc7bcddcb51234e23494531f93ab60475e1c3 was intended to shorten the TPM respond time, but it does not work well for atmel TPM chips. Probably we should override TPM_TIMEOUT value for atmel chips at least. Thanks Hao > On Sep 27, 2020, at 5:11 PM, Hao Wu <hao.wu@xxxxxxxxxx> wrote: > > Hi James, > > Maybe there is a misunderstanding. Here I am using tpm_msleep, not msleep. > tpm_msleep is using usleep underlaying. See > https://github.com/torvalds/linux/blob/master/drivers/char/tpm/tpm.h#L188 > > The reasons I choose 15ms, is because before > https://github.com/torvalds/linux/commit/9f3fc7bcddcb51234e23494531f93ab60475e1c3 > (Where msleep is changed to tpm_msleep (which is essentially usleep)), > The actual sleep time is 15ms, thus here we change this back to 15ms to fix > regression. > > Thanks > Hao > >> On Sep 27, 2020, at 11:25 AM, James Bottomley <James.Bottomley@xxxxxxxxxxxxxxxxxxxxx> wrote: >> >> On Sat, 2020-09-26 at 16:10 -0700, Hao Wu wrote: >>> Resending following email in plaintext. >>> >>> ---- >>> >>> Hi James, >>> >>> Thanks for following up. >>> >>> We have actually tried change >>> TPM_TIMEOUT_USECS_MIN / TPM_TIMEOUT_USECS_MAX >>> according to https://patchwork.kernel.org/patch/10520247/ >>> It does not solve the problem for ATMEL chip. The chips facing crash >>> is >>> not experimental, but happens commonly in >>> the production systems we and our customers are using. >>> It is widely found in Cisco 220 / 240 systems which are using >>> Ateml chips. >> >> Well, I came up with the values in that patch by trial and error .... >> all I know is they work for my nuvoton. If they're not right for you, >> see if you can find what values actually do work for your TPM. The >> difference between msleep and usleep_range is that the former can have >> an indefinitely long timeout and the latter has a range bounded one. >> If you think msleep works for you, the chances are it doesn't and >> you're relying on the large upper bound to make the bug infrequent >> enough for you not to see it. Playing with the values in usleep range >> will help you find what the actual timeout is and eliminate the problem >> for good. >> >> James >