Re: [PATCH] Fix Atmel TPM crash caused by too frequent queries

Hao Wu <hao.wu@xxxxxxxxxx> · Sat, 17 Oct 2020 22:20:18 -0700

> On Oct 17, 2020, at 10:09 PM, Jarkko Sakkinen <jarkko.sakkinen@xxxxxxxxxxxxxxx> wrote:
> 
> On Fri, Oct 16, 2020 at 11:11:37PM -0700, Hao Wu wrote:
>>> On Oct 1, 2020, at 4:04 PM, Jarkko Sakkinen <jarkko.sakkinen@xxxxxxxxxxxxxxx> wrote:
>>> 
>>> On Thu, Oct 01, 2020 at 11:32:59AM -0700, James Bottomley wrote:
>>>> On Thu, 2020-10-01 at 14:15 -0400, Nayna wrote:
>>>>> On 10/1/20 12:53 AM, James Bottomley wrote:
>>>>>> On Thu, 2020-10-01 at 04:50 +0300, Jarkko Sakkinen wrote:
>>>>>>> On Wed, Sep 30, 2020 at 03:31:20PM -0700, James Bottomley wrote:
>>>>>>>> On Thu, 2020-10-01 at 00:09 +0300, Jarkko Sakkinen wrote:
>>>> [...]
>>>>>>>>> I also wonder if we could adjust the frequency dynamically.
>>>>>>>>> I.e. start with optimistic value and lower it until finding
>>>>>>>>> the sweet spot.
>>>>>>>> 
>>>>>>>> The problem is the way this crashes: the TPM seems to be
>>>>>>>> unrecoverable. If it were recoverable without a hard reset of
>>>>>>>> the entire machine, we could certainly play around with it.  I
>>>>>>>> can try alternative mechanisms to see if anything's viable, but
>>>>>>>> to all intents and purposes, it looks like my TPM simply stops
>>>>>>>> responding to the TIS interface.
>>>>>>> 
>>>>>>> A quickly scraped idea probably with some holes in it but I was
>>>>>>> thinking something like
>>>>>>> 
>>>>>>> 1. Initially set slow value for latency, this could be the
>>>>>>> original 15 ms.
>>>>>>> 2. Use this to read TPM_PT_VENDOR_STRING_*.
>>>>>>> 3. Lookup based vendor string from a fixup table a latency that
>>>>>>> works
>>>>>>>   (the fallback latency could be the existing latency).
>>>>>> 
>>>>>> Well, yes, that was sort of what I was thinking of doing for the
>>>>>> Atmel ... except I was thinking of using the TIS VID (16 byte
>>>>>> assigned vendor ID) which means we can get the information to set
>>>>>> the timeout before we have to do any TPM operations.
>>>>> 
>>>>> I wonder if the timeout issue exists for all TPM commands for the
>>>>> same manufacturer.  For example, does the ATMEL TPM also crash when 
>>>>> extending  PCRs ?
>>>>> 
>>>>> In addition to defining a per TPM vendor based lookup table for
>>>>> timeout, would it be a good idea to also define a Kconfig/boot param
>>>>> option to allow timeout setting.  This will enable to set the timeout
>>>>> based on the specific use.
>>>> 
>>>> I don't think we need go that far (yet).  The timing change has been in
>>>> upstream since:
>>>> 
>>>> commit 424eaf910c329ab06ad03a527ef45dcf6a328f00
>>>> Author: Nayna Jain <nayna@xxxxxxxxxxxxxxxxxx>
>>>> Date:   Wed May 16 01:51:25 2018 -0400
>>>> 
>>>>   tpm: reduce polling time to usecs for even finer granularity
>>>> 
>>>> Which was in the released kernel 4.18: over two years ago.  In all that
>>>> time we've discovered two problems: mine which looks to be an artifact
>>>> of an experimental upgrade process in a new nuvoton and the Atmel. 
>>>> That means pretty much every other TPM simply works with the existing
>>>> timings
>>>> 
>>>>> I was also thinking how will we decide the lookup table values for
>>>>> each vendor ?
>>>> 
>>>> I wasn't thinking we would.  I was thinking I'd do a simple exception
>>>> for the Atmel and nothing else.  I don't think my Nuvoton is in any way
>>>> characteristic.  Indeed my pluggable TPM rainbow bridge system works
>>>> just fine with a Nuvoton and the current timings.
>>>> 
>>>> We can add additional exceptions if they actually turn up.
>>> 
>>> I'd add a table and fallback.
>>> 
>> 
>> Hi folks,
>> 
>> I want to follow up this a bit and check whether we reached a consensus 
>> on how to fix the timeout issue for Atmel chip.
>> 
>> Should we revert the changes or introduce the lookup table for chips.
>> 
>> Is there anything I can help from Rubrik side.
>> 
>> Thanks
>> Hao
> 
> There is nothing to revert as the previous was not applied but I'm
> of course ready to review any new attempts.
> 

Hi Jarkko,

By “revert” I meant we revert the timeout value changes by applying
the patch I proposed, as the timeout value discussed does cause issues.

Why don’t we apply the patch and improve the perf in the way of not
breaking TPMs ? 

Hao