Re: [PATCH] Fix Atmel TPM crash caused by too frequent queries

Hao Wu <hao.wu@xxxxxxxxxx> · Sat, 8 May 2021 23:18:02 -0700

> On Nov 18, 2020, at 1:11 PM, Jarkko Sakkinen <jarkko.sakkinen@xxxxxxxxxxxxxxx> wrote:
> 
> On Fri, Nov 13, 2020 at 08:39:28PM -0800, Hao Wu wrote:
>>> On Oct 17, 2020, at 10:20 PM, Hao Wu <hao.wu@xxxxxxxxxx> wrote:
>>> 
>>>> On Oct 17, 2020, at 10:09 PM, Jarkko Sakkinen <jarkko.sakkinen@xxxxxxxxxxxxxxx> wrote:
>>>> 
>>>> On Fri, Oct 16, 2020 at 11:11:37PM -0700, Hao Wu wrote:
>>>>>> On Oct 1, 2020, at 4:04 PM, Jarkko Sakkinen <jarkko.sakkinen@xxxxxxxxxxxxxxx> wrote:
>>>>>> 
>>>>>> On Thu, Oct 01, 2020 at 11:32:59AM -0700, James Bottomley wrote:
>>>>>>> On Thu, 2020-10-01 at 14:15 -0400, Nayna wrote:
>>>>>>>> On 10/1/20 12:53 AM, James Bottomley wrote:
>>>>>>>>> On Thu, 2020-10-01 at 04:50 +0300, Jarkko Sakkinen wrote:
>>>>>>>>>> On Wed, Sep 30, 2020 at 03:31:20PM -0700, James Bottomley wrote:
>>>>>>>>>>> On Thu, 2020-10-01 at 00:09 +0300, Jarkko Sakkinen wrote:
>>>>>>> [...]
>>>>>>>>>>>> I also wonder if we could adjust the frequency dynamically.
>>>>>>>>>>>> I.e. start with optimistic value and lower it until finding
>>>>>>>>>>>> the sweet spot.
>>>>>>>>>>> 
>>>>>>>>>>> The problem is the way this crashes: the TPM seems to be
>>>>>>>>>>> unrecoverable. If it were recoverable without a hard reset of
>>>>>>>>>>> the entire machine, we could certainly play around with it.  I
>>>>>>>>>>> can try alternative mechanisms to see if anything's viable, but
>>>>>>>>>>> to all intents and purposes, it looks like my TPM simply stops
>>>>>>>>>>> responding to the TIS interface.
>>>>>>>>>> 
>>>>>>>>>> A quickly scraped idea probably with some holes in it but I was
>>>>>>>>>> thinking something like
>>>>>>>>>> 
>>>>>>>>>> 1. Initially set slow value for latency, this could be the
>>>>>>>>>> original 15 ms.
>>>>>>>>>> 2. Use this to read TPM_PT_VENDOR_STRING_*.
>>>>>>>>>> 3. Lookup based vendor string from a fixup table a latency that
>>>>>>>>>> works
>>>>>>>>>> (the fallback latency could be the existing latency).
>>>>>>>>> 
>>>>>>>>> Well, yes, that was sort of what I was thinking of doing for the
>>>>>>>>> Atmel ... except I was thinking of using the TIS VID (16 byte
>>>>>>>>> assigned vendor ID) which means we can get the information to set
>>>>>>>>> the timeout before we have to do any TPM operations.
>>>>>>>> 
>>>>>>>> I wonder if the timeout issue exists for all TPM commands for the
>>>>>>>> same manufacturer.  For example, does the ATMEL TPM also crash when 
>>>>>>>> extending  PCRs ?
>>>>>>>> 
>>>>>>>> In addition to defining a per TPM vendor based lookup table for
>>>>>>>> timeout, would it be a good idea to also define a Kconfig/boot param
>>>>>>>> option to allow timeout setting.  This will enable to set the timeout
>>>>>>>> based on the specific use.
>>>>>>> 
>>>>>>> I don't think we need go that far (yet).  The timing change has been in
>>>>>>> upstream since:
>>>>>>> 
>>>>>>> commit 424eaf910c329ab06ad03a527ef45dcf6a328f00
>>>>>>> Author: Nayna Jain <nayna@xxxxxxxxxxxxxxxxxx>
>>>>>>> Date:   Wed May 16 01:51:25 2018 -0400
>>>>>>> 
>>>>>>> tpm: reduce polling time to usecs for even finer granularity
>>>>>>> 
>>>>>>> Which was in the released kernel 4.18: over two years ago.  In all that
>>>>>>> time we've discovered two problems: mine which looks to be an artifact
>>>>>>> of an experimental upgrade process in a new nuvoton and the Atmel. 
>>>>>>> That means pretty much every other TPM simply works with the existing
>>>>>>> timings
>>>>>>> 
>>>>>>>> I was also thinking how will we decide the lookup table values for
>>>>>>>> each vendor ?
>>>>>>> 
>>>>>>> I wasn't thinking we would.  I was thinking I'd do a simple exception
>>>>>>> for the Atmel and nothing else.  I don't think my Nuvoton is in any way
>>>>>>> characteristic.  Indeed my pluggable TPM rainbow bridge system works
>>>>>>> just fine with a Nuvoton and the current timings.
>>>>>>> 
>>>>>>> We can add additional exceptions if they actually turn up.
>>>>>> 
>>>>>> I'd add a table and fallback.
>>>>>> 
>>>>> 
>>>>> Hi folks,
>>>>> 
>>>>> I want to follow up this a bit and check whether we reached a consensus 
>>>>> on how to fix the timeout issue for Atmel chip.
>>>>> 
>>>>> Should we revert the changes or introduce the lookup table for chips.
>>>>> 
>>>>> Is there anything I can help from Rubrik side.
>>>>> 
>>>>> Thanks
>>>>> Hao
>>>> 
>>>> There is nothing to revert as the previous was not applied but I'm
>>>> of course ready to review any new attempts.
>>>> 
>>> 
>>> Hi Jarkko,
>>> 
>>> By “revert” I meant we revert the timeout value changes by applying
>>> the patch I proposed, as the timeout value discussed does cause issues.
>>> 
>>> Why don’t we apply the patch and improve the perf in the way of not
>>> breaking TPMs ? 
>>> 
>>> Hao
>> 
>> Hi Jarkko and folks,
>> 
>> It’s being a while since our last discussion. I want to push a fix in the upstream for ateml chip. 
>> It looks like we currently have following choices:
>> 1.  generic fix for all vendors: have a lookup table for sleep time of wait_for_tpm_stat 
>>  (i.e. TPM_TIMEOUT_WAIT_STAT in my proposed patch) 
>> 2.  quick fix for the regression: change the sleep time of wait_for_tpm_stat back to 15ms.
>>  It is the current proposed patch
>> 3. Fix regression by making exception for ateml chip.  
>> 
>> Should we reach consensus on which one we want to pursue before dig
>> into implementation of the patch? In my opinion, I prefer to fix the
>> regression with 2, and then pursue 1 as long-term solution. 3 is
>> hacky.
> 
> What does option 1 fix for *all* vendors?
> 
>> Let me know what do you guys think
>> 
>> Hao
> 
> /Jarkko

Hi Jarkko and folks,

It has been a while again. In my previous message I answered Jarkko’s question about the option 1.
Jarkko, let me know if it is clear to you or you have further questions and suggestions on next to do.
Somehow I couldn’t found the last message I sent but it is in 
https://patchwork.kernel.org/project/linux-integrity/patch/20200926223150.109645-1-hao.wu@xxxxxxxxxx/

In high-level, the option 1 is to add a timing lookup table for each manufacture, hence we can
configure timing for each chip respectively. Then we don’t need to worry about fixing ATMEL
timing may cause performance degradation for other chips.

I do want to push the fix in TPM driver, which is likely to be hit going forward again when people are doing
refactoring without testing chips from all manufacturing.

Let me know how should I push this forward.

Thanks
Hao