> On Nov 18, 2020, at 1:11 PM, Jarkko Sakkinen <jarkko.sakkinen@xxxxxxxxxxxxxxx> wrote: > > On Fri, Nov 13, 2020 at 08:39:28PM -0800, Hao Wu wrote: >>> On Oct 17, 2020, at 10:20 PM, Hao Wu <hao.wu@xxxxxxxxxx> wrote: >>> >>>> On Oct 17, 2020, at 10:09 PM, Jarkko Sakkinen <jarkko.sakkinen@xxxxxxxxxxxxxxx> wrote: >>>> >>>> On Fri, Oct 16, 2020 at 11:11:37PM -0700, Hao Wu wrote: >>>>>> On Oct 1, 2020, at 4:04 PM, Jarkko Sakkinen <jarkko.sakkinen@xxxxxxxxxxxxxxx> wrote: >>>>>> >>>>>> On Thu, Oct 01, 2020 at 11:32:59AM -0700, James Bottomley wrote: >>>>>>> On Thu, 2020-10-01 at 14:15 -0400, Nayna wrote: >>>>>>>> On 10/1/20 12:53 AM, James Bottomley wrote: >>>>>>>>> On Thu, 2020-10-01 at 04:50 +0300, Jarkko Sakkinen wrote: >>>>>>>>>> On Wed, Sep 30, 2020 at 03:31:20PM -0700, James Bottomley wrote: >>>>>>>>>>> On Thu, 2020-10-01 at 00:09 +0300, Jarkko Sakkinen wrote: >>>>>>> [...] >>>>>>>>>>>> I also wonder if we could adjust the frequency dynamically. >>>>>>>>>>>> I.e. start with optimistic value and lower it until finding >>>>>>>>>>>> the sweet spot. >>>>>>>>>>> >>>>>>>>>>> The problem is the way this crashes: the TPM seems to be >>>>>>>>>>> unrecoverable. If it were recoverable without a hard reset of >>>>>>>>>>> the entire machine, we could certainly play around with it. I >>>>>>>>>>> can try alternative mechanisms to see if anything's viable, but >>>>>>>>>>> to all intents and purposes, it looks like my TPM simply stops >>>>>>>>>>> responding to the TIS interface. >>>>>>>>>> >>>>>>>>>> A quickly scraped idea probably with some holes in it but I was >>>>>>>>>> thinking something like >>>>>>>>>> >>>>>>>>>> 1. Initially set slow value for latency, this could be the >>>>>>>>>> original 15 ms. >>>>>>>>>> 2. Use this to read TPM_PT_VENDOR_STRING_*. >>>>>>>>>> 3. Lookup based vendor string from a fixup table a latency that >>>>>>>>>> works >>>>>>>>>> (the fallback latency could be the existing latency). >>>>>>>>> >>>>>>>>> Well, yes, that was sort of what I was thinking of doing for the >>>>>>>>> Atmel ... except I was thinking of using the TIS VID (16 byte >>>>>>>>> assigned vendor ID) which means we can get the information to set >>>>>>>>> the timeout before we have to do any TPM operations. >>>>>>>> >>>>>>>> I wonder if the timeout issue exists for all TPM commands for the >>>>>>>> same manufacturer. For example, does the ATMEL TPM also crash when >>>>>>>> extending PCRs ? >>>>>>>> >>>>>>>> In addition to defining a per TPM vendor based lookup table for >>>>>>>> timeout, would it be a good idea to also define a Kconfig/boot param >>>>>>>> option to allow timeout setting. This will enable to set the timeout >>>>>>>> based on the specific use. >>>>>>> >>>>>>> I don't think we need go that far (yet). The timing change has been in >>>>>>> upstream since: >>>>>>> >>>>>>> commit 424eaf910c329ab06ad03a527ef45dcf6a328f00 >>>>>>> Author: Nayna Jain <nayna@xxxxxxxxxxxxxxxxxx> >>>>>>> Date: Wed May 16 01:51:25 2018 -0400 >>>>>>> >>>>>>> tpm: reduce polling time to usecs for even finer granularity >>>>>>> >>>>>>> Which was in the released kernel 4.18: over two years ago. In all that >>>>>>> time we've discovered two problems: mine which looks to be an artifact >>>>>>> of an experimental upgrade process in a new nuvoton and the Atmel. >>>>>>> That means pretty much every other TPM simply works with the existing >>>>>>> timings >>>>>>> >>>>>>>> I was also thinking how will we decide the lookup table values for >>>>>>>> each vendor ? >>>>>>> >>>>>>> I wasn't thinking we would. I was thinking I'd do a simple exception >>>>>>> for the Atmel and nothing else. I don't think my Nuvoton is in any way >>>>>>> characteristic. Indeed my pluggable TPM rainbow bridge system works >>>>>>> just fine with a Nuvoton and the current timings. >>>>>>> >>>>>>> We can add additional exceptions if they actually turn up. >>>>>> >>>>>> I'd add a table and fallback. >>>>>> >>>>> >>>>> Hi folks, >>>>> >>>>> I want to follow up this a bit and check whether we reached a consensus >>>>> on how to fix the timeout issue for Atmel chip. >>>>> >>>>> Should we revert the changes or introduce the lookup table for chips. >>>>> >>>>> Is there anything I can help from Rubrik side. >>>>> >>>>> Thanks >>>>> Hao >>>> >>>> There is nothing to revert as the previous was not applied but I'm >>>> of course ready to review any new attempts. >>>> >>> >>> Hi Jarkko, >>> >>> By “revert” I meant we revert the timeout value changes by applying >>> the patch I proposed, as the timeout value discussed does cause issues. >>> >>> Why don’t we apply the patch and improve the perf in the way of not >>> breaking TPMs ? >>> >>> Hao >> >> Hi Jarkko and folks, >> >> It’s being a while since our last discussion. I want to push a fix in the upstream for ateml chip. >> It looks like we currently have following choices: >> 1. generic fix for all vendors: have a lookup table for sleep time of wait_for_tpm_stat >> (i.e. TPM_TIMEOUT_WAIT_STAT in my proposed patch) >> 2. quick fix for the regression: change the sleep time of wait_for_tpm_stat back to 15ms. >> It is the current proposed patch >> 3. Fix regression by making exception for ateml chip. >> >> Should we reach consensus on which one we want to pursue before dig >> into implementation of the patch? In my opinion, I prefer to fix the >> regression with 2, and then pursue 1 as long-term solution. 3 is >> hacky. > > What does option 1 fix for *all* vendors? > >> Let me know what do you guys think >> >> Hao > > /Jarkko I meant timing is potentially sensitive per vendor, the option 1 is a solution for all vendor if this issue happening again. The option 1 is just what you mentioned in the earlier discussion: >>>>>> I'd add a table and fallback. Hao