Re: [PATCH] Fix Atmel TPM crash caused by too frequent queries

Hao Wu <hao.wu@xxxxxxxxxx> · Wed, 18 Nov 2020 15:23:12 -0800



> On Nov 18, 2020, at 1:11 PM, Jarkko Sakkinen <jarkko.sakkinen@xxxxxxxxxxxxxxx> wrote:
> 
> On Fri, Nov 13, 2020 at 08:39:28PM -0800, Hao Wu wrote:
>>> On Oct 17, 2020, at 10:20 PM, Hao Wu <hao.wu@xxxxxxxxxx> wrote:
>>> 
>>>> On Oct 17, 2020, at 10:09 PM, Jarkko Sakkinen <jarkko.sakkinen@xxxxxxxxxxxxxxx> wrote:
>>>> 
>>>> On Fri, Oct 16, 2020 at 11:11:37PM -0700, Hao Wu wrote:
>>>>>> On Oct 1, 2020, at 4:04 PM, Jarkko Sakkinen <jarkko.sakkinen@xxxxxxxxxxxxxxx> wrote:
>>>>>> 
>>>>>> On Thu, Oct 01, 2020 at 11:32:59AM -0700, James Bottomley wrote:
>>>>>>> On Thu, 2020-10-01 at 14:15 -0400, Nayna wrote:
>>>>>>>> On 10/1/20 12:53 AM, James Bottomley wrote:
>>>>>>>>> On Thu, 2020-10-01 at 04:50 +0300, Jarkko Sakkinen wrote:
>>>>>>>>>> On Wed, Sep 30, 2020 at 03:31:20PM -0700, James Bottomley wrote:
>>>>>>>>>>> On Thu, 2020-10-01 at 00:09 +0300, Jarkko Sakkinen wrote:
>>>>>>> [...]
>>>>>>>>>>>> I also wonder if we could adjust the frequency dynamically.
>>>>>>>>>>>> I.e. start with optimistic value and lower it until finding
>>>>>>>>>>>> the sweet spot.
>>>>>>>>>>> 
>>>>>>>>>>> The problem is the way this crashes: the TPM seems to be
>>>>>>>>>>> unrecoverable. If it were recoverable without a hard reset of
>>>>>>>>>>> the entire machine, we could certainly play around with it.  I
>>>>>>>>>>> can try alternative mechanisms to see if anything's viable, but
>>>>>>>>>>> to all intents and purposes, it looks like my TPM simply stops
>>>>>>>>>>> responding to the TIS interface.
>>>>>>>>>> 
>>>>>>>>>> A quickly scraped idea probably with some holes in it but I was
>>>>>>>>>> thinking something like
>>>>>>>>>> 
>>>>>>>>>> 1. Initially set slow value for latency, this could be the
>>>>>>>>>> original 15 ms.
>>>>>>>>>> 2. Use this to read TPM_PT_VENDOR_STRING_*.
>>>>>>>>>> 3. Lookup based vendor string from a fixup table a latency that
>>>>>>>>>> works
>>>>>>>>>> (the fallback latency could be the existing latency).
>>>>>>>>> 
>>>>>>>>> Well, yes, that was sort of what I was thinking of doing for the
>>>>>>>>> Atmel ... except I was thinking of using the TIS VID (16 byte
>>>>>>>>> assigned vendor ID) which means we can get the information to set
>>>>>>>>> the timeout before we have to do any TPM operations.
>>>>>>>> 
>>>>>>>> I wonder if the timeout issue exists for all TPM commands for the
>>>>>>>> same manufacturer.  For example, does the ATMEL TPM also crash when 
>>>>>>>> extending  PCRs ?
>>>>>>>> 
>>>>>>>> In addition to defining a per TPM vendor based lookup table for
>>>>>>>> timeout, would it be a good idea to also define a Kconfig/boot param
>>>>>>>> option to allow timeout setting.  This will enable to set the timeout
>>>>>>>> based on the specific use.
>>>>>>> 
>>>>>>> I don't think we need go that far (yet).  The timing change has been in
>>>>>>> upstream since:
>>>>>>> 
>>>>>>> commit 424eaf910c329ab06ad03a527ef45dcf6a328f00
>>>>>>> Author: Nayna Jain <nayna@xxxxxxxxxxxxxxxxxx>
>>>>>>> Date:   Wed May 16 01:51:25 2018 -0400
>>>>>>> 
>>>>>>> tpm: reduce polling time to usecs for even finer granularity
>>>>>>> 
>>>>>>> Which was in the released kernel 4.18: over two years ago.  In all that
>>>>>>> time we've discovered two problems: mine which looks to be an artifact
>>>>>>> of an experimental upgrade process in a new nuvoton and the Atmel. 
>>>>>>> That means pretty much every other TPM simply works with the existing
>>>>>>> timings
>>>>>>> 
>>>>>>>> I was also thinking how will we decide the lookup table values for
>>>>>>>> each vendor ?
>>>>>>> 
>>>>>>> I wasn't thinking we would.  I was thinking I'd do a simple exception
>>>>>>> for the Atmel and nothing else.  I don't think my Nuvoton is in any way
>>>>>>> characteristic.  Indeed my pluggable TPM rainbow bridge system works
>>>>>>> just fine with a Nuvoton and the current timings.
>>>>>>> 
>>>>>>> We can add additional exceptions if they actually turn up.
>>>>>> 
>>>>>> I'd add a table and fallback.
>>>>>> 
>>>>> 
>>>>> Hi folks,
>>>>> 
>>>>> I want to follow up this a bit and check whether we reached a consensus 
>>>>> on how to fix the timeout issue for Atmel chip.
>>>>> 
>>>>> Should we revert the changes or introduce the lookup table for chips.
>>>>> 
>>>>> Is there anything I can help from Rubrik side.
>>>>> 
>>>>> Thanks
>>>>> Hao
>>>> 
>>>> There is nothing to revert as the previous was not applied but I'm
>>>> of course ready to review any new attempts.
>>>> 
>>> 
>>> Hi Jarkko,
>>> 
>>> By “revert” I meant we revert the timeout value changes by applying
>>> the patch I proposed, as the timeout value discussed does cause issues.
>>> 
>>> Why don’t we apply the patch and improve the perf in the way of not
>>> breaking TPMs ? 
>>> 
>>> Hao
>> 
>> Hi Jarkko and folks,
>> 
>> It’s being a while since our last discussion. I want to push a fix in the upstream for ateml chip. 
>> It looks like we currently have following choices:
>> 1.  generic fix for all vendors: have a lookup table for sleep time of wait_for_tpm_stat 
>>  (i.e. TPM_TIMEOUT_WAIT_STAT in my proposed patch) 
>> 2.  quick fix for the regression: change the sleep time of wait_for_tpm_stat back to 15ms.
>>  It is the current proposed patch
>> 3. Fix regression by making exception for ateml chip.  
>> 
>> Should we reach consensus on which one we want to pursue before dig
>> into implementation of the patch? In my opinion, I prefer to fix the
>> regression with 2, and then pursue 1 as long-term solution. 3 is
>> hacky.
> 
> What does option 1 fix for *all* vendors?
> 
>> Let me know what do you guys think
>> 
>> Hao
> 
> /Jarkko

I meant timing is potentially sensitive per vendor,
the option 1 is a solution for all vendor if this issue happening again.
The option 1 is just what you mentioned in the earlier discussion:
>>>>>> I'd add a table and fallback.


Hao