Re: [PATCH] Fix Atmel TPM crash caused by too frequent queries

Hao Wu <hao.wu@xxxxxxxxxx> · Sun, 9 May 2021 20:15:32 -0700

> On May 9, 2021, at 7:17 PM, Mimi Zohar <zohar@xxxxxxxxxxxxx> wrote:
> 
> [Cc'ing Jarkko Sakkinen <jarkko@xxxxxxxxxx>]
> 
> On Sat, 2021-05-08 at 23:31 -0700, Hao Wu wrote:
>>> On May 8, 2021, at 11:18 PM, Hao Wu <hao.wu@xxxxxxxxxx> wrote:
>>> 
>>>> On Nov 18, 2020, at 1:11 PM, Jarkko Sakkinen <jarkko.sakkinen@xxxxxxxxxxxxxxx> wrote:
>>>> 
>>>> On Fri, Nov 13, 2020 at 08:39:28PM -0800, Hao Wu wrote:
>>>>>> On Oct 17, 2020, at 10:20 PM, Hao Wu <hao.wu@xxxxxxxxxx> wrote:
>>>>>> 
>>>>>>> On Oct 17, 2020, at 10:09 PM, Jarkko Sakkinen <jarkko.sakkinen@xxxxxxxxxxxxxxx> wrote:
>>>>>>> 
>>>>>>> On Fri, Oct 16, 2020 at 11:11:37PM -0700, Hao Wu wrote:
>>>>>>>>> On Oct 1, 2020, at 4:04 PM, Jarkko Sakkinen <jarkko.sakkinen@xxxxxxxxxxxxxxx> wrote:
>>>>>>>>> 
>>>>>>>>> On Thu, Oct 01, 2020 at 11:32:59AM -0700, James Bottomley wrote:
>>>>>>>>>> On Thu, 2020-10-01 at 14:15 -0400, Nayna wrote:
>>>>>>>>>>> On 10/1/20 12:53 AM, James Bottomley wrote:
>>>>>>>>>>>> On Thu, 2020-10-01 at 04:50 +0300, Jarkko Sakkinen wrote:
>>>>>>>>>>>>> On Wed, Sep 30, 2020 at 03:31:20PM -0700, James Bottomley wrote:
>>>>>>>>>>>>>> On Thu, 2020-10-01 at 00:09 +0300, Jarkko Sakkinen wrote:
>>>>>>>>>> [...]
>>>>>>>>>>>>>>> I also wonder if we could adjust the frequency dynamically.
>>>>>>>>>>>>>>> I.e. start with optimistic value and lower it until finding
>>>>>>>>>>>>>>> the sweet spot.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> The problem is the way this crashes: the TPM seems to be
>>>>>>>>>>>>>> unrecoverable. If it were recoverable without a hard reset of
>>>>>>>>>>>>>> the entire machine, we could certainly play around with it.  I
>>>>>>>>>>>>>> can try alternative mechanisms to see if anything's viable, but
>>>>>>>>>>>>>> to all intents and purposes, it looks like my TPM simply stops
>>>>>>>>>>>>>> responding to the TIS interface.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> A quickly scraped idea probably with some holes in it but I was
>>>>>>>>>>>>> thinking something like
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 1. Initially set slow value for latency, this could be the
>>>>>>>>>>>>> original 15 ms.
>>>>>>>>>>>>> 2. Use this to read TPM_PT_VENDOR_STRING_*.
>>>>>>>>>>>>> 3. Lookup based vendor string from a fixup table a latency that
>>>>>>>>>>>>> works
>>>>>>>>>>>>> (the fallback latency could be the existing latency).
>>>>>>>>>>>> 
>>>>>>>>>>>> Well, yes, that was sort of what I was thinking of doing for the
>>>>>>>>>>>> Atmel ... except I was thinking of using the TIS VID (16 byte
>>>>>>>>>>>> assigned vendor ID) which means we can get the information to set
>>>>>>>>>>>> the timeout before we have to do any TPM operations.
>>>>>>>>>>> 
>>>>>>>>>>> I wonder if the timeout issue exists for all TPM commands for the
>>>>>>>>>>> same manufacturer.  For example, does the ATMEL TPM also crash when 
>>>>>>>>>>> extending  PCRs ?
>>>>>>>>>>> 
>>>>>>>>>>> In addition to defining a per TPM vendor based lookup table for
>>>>>>>>>>> timeout, would it be a good idea to also define a Kconfig/boot param
>>>>>>>>>>> option to allow timeout setting.  This will enable to set the timeout
>>>>>>>>>>> based on the specific use.
>>>>>>>>>> 
>>>>>>>>>> I don't think we need go that far (yet).  The timing change has been in
>>>>>>>>>> upstream since:
>>>>>>>>>> 
>>>>>>>>>> commit 424eaf910c329ab06ad03a527ef45dcf6a328f00
>>>>>>>>>> Author: Nayna Jain <nayna@xxxxxxxxxxxxxxxxxx>
>>>>>>>>>> Date:   Wed May 16 01:51:25 2018 -0400
>>>>>>>>>> 
>>>>>>>>>> tpm: reduce polling time to usecs for even finer granularity
>>>>>>>>>> 
>>>>>>>>>> Which was in the released kernel 4.18: over two years ago.  In all that
>>>>>>>>>> time we've discovered two problems: mine which looks to be an artifact
>>>>>>>>>> of an experimental upgrade process in a new nuvoton and the Atmel. 
>>>>>>>>>> That means pretty much every other TPM simply works with the existing
>>>>>>>>>> timings
>>>>>>>>>> 
>>>>>>>>>>> I was also thinking how will we decide the lookup table values for
>>>>>>>>>>> each vendor ?
>>>>>>>>>> 
>>>>>>>>>> I wasn't thinking we would.  I was thinking I'd do a simple exception
>>>>>>>>>> for the Atmel and nothing else.  I don't think my Nuvoton is in any way
>>>>>>>>>> characteristic.  Indeed my pluggable TPM rainbow bridge system works
>>>>>>>>>> just fine with a Nuvoton and the current timings.
>>>>>>>>>> 
>>>>>>>>>> We can add additional exceptions if they actually turn up.
>>>>>>>>> 
>>>>>>>>> I'd add a table and fallback.
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> Hi folks,
>>>>>>>> 
>>>>>>>> I want to follow up this a bit and check whether we reached a consensus 
>>>>>>>> on how to fix the timeout issue for Atmel chip.
>>>>>>>> 
>>>>>>>> Should we revert the changes or introduce the lookup table for chips.
>>>>>>>> 
>>>>>>>> Is there anything I can help from Rubrik side.
>>>>>>>> 
>>>>>>>> Thanks
>>>>>>>> Hao
>>>>>>> 
>>>>>>> There is nothing to revert as the previous was not applied but I'm
>>>>>>> of course ready to review any new attempts.
>>>>>>> 
>>>>>> 
>>>>>> Hi Jarkko,
>>>>>> 
>>>>>> By “revert” I meant we revert the timeout value changes by applying
>>>>>> the patch I proposed, as the timeout value discussed does cause issues.
>>>>>> 
>>>>>> Why don’t we apply the patch and improve the perf in the way of not
>>>>>> breaking TPMs ? 
>>>>>> 
>>>>>> Hao
>>>>> 
>>>>> Hi Jarkko and folks,
>>>>> 
>>>>> It’s being a while since our last discussion. I want to push a fix in the upstream for ateml chip. 
>>>>> It looks like we currently have following choices:
>>>>> 1.  generic fix for all vendors: have a lookup table for sleep time of wait_for_tpm_stat 
>>>>> (i.e. TPM_TIMEOUT_WAIT_STAT in my proposed patch) 
>>>>> 2.  quick fix for the regression: change the sleep time of wait_for_tpm_stat back to 15ms.
>>>>> It is the current proposed patch
>>>>> 3. Fix regression by making exception for ateml chip.  
>>>>> 
>>>>> Should we reach consensus on which one we want to pursue before dig
>>>>> into implementation of the patch? In my opinion, I prefer to fix the
>>>>> regression with 2, and then pursue 1 as long-term solution. 3 is
>>>>> hacky.
>>>> 
>>>> What does option 1 fix for *all* vendors?
>>>> 
>>>>> Let me know what do you guys think
>>>>> 
>>>>> Hao
>>>> 
>>>> /Jarkko
>>> 
>>> Hi Jarkko and folks,
>>> 
>>> It has been a while again. In my previous message I answered Jarkko’s question about the option 1.
>>> Jarkko, let me know if it is clear to you or you have further questions and suggestions on next to do.
>>> Somehow I couldn’t found the last message I sent but it is in 
>>> https://patchwork.kernel.org/project/linux-integrity/patch/20200926223150.109645-1-hao.wu@xxxxxxxxxx/
>>> 
>>> In high-level, the option 1 is to add a timing lookup table for each manufacture, hence we can
>>> configure timing for each chip respectively. Then we don’t need to worry about fixing ATMEL
>>> timing may cause performance degradation for other chips.
>>> 
>>> I do want to push the fix in TPM driver, which is likely to be hit going forward again when people are doing
>>> refactoring without testing chips from all manufacturing.
>>> 
>>> Let me know how should I push this forward.
>>> 
>>> Thanks
>>> Hao
>>> 
>> It looks like Jarkko’s email address (jarkko.sakkinen@xxxxxxxxxxxxxxx) is unreachable now,
>> can other TPM maintainer / reviewer help make a call and unblock this ? 
> 
> A while ago Jarkko asked everyone to use his kernel.org address.
> 
> Mimi

Ah thanks Mimi, just found Jarkko’s address.

Jarkko please check the message above when you have a chance.

Hao