Hi Hamza, Paul, and Ken, I just took a look at the TPM driver code and specifically tpm_infineon.c, but it looks like it is not that straightforward to override TPM_TIMEOUT by TPM chip today. I may not have good expertise to make that done shortly. Could we checkin the global fix first to address the regression? Or someone with more TPM driver expertise can help work on the TPM-chip specific fix? Thanks Hao > On Sep 12, 2020, at 9:51 PM, Hao Wu <hao.wu@xxxxxxxxxx> wrote: > > Sorry again and resend following email in plaintext > ------- > > Hi Hamza and everyone, > > Thanks Hamza for the response on this issue. > >> The first aim of the patch you mentioned is not to shorten TPM timings, as no >> timing is changed at all on this patch, but to count time more accurately. > It is true that the code was not intended to do so explicitly, however the fact is that counting time more accurately in this context is shortening the TPM timings. > >> If your TPM needs more time to perform a specific operation, it has nothing to >> do with this patch, but rather with its specified timings in the kernel. The >> timings for your Atmel TPM were certainly wrong before 4.14 to begin with, they >> were not counted accurately and your TPM working was potentially a "positive" >> side effect of the imprecise time counting (ie non-specified delays were added >> every time msleep was used instead of usleep_range). >> TPM polling should not be affected because we use a different way to count >> timings in the kernel. > As our experiments show, the msleep is not inaccurate specifically to atmel TPM, but inaccurate in general in the kernel (or linux). Before the patch, it is likely all TPM timings are around 15ms instead of 5ms. (Which is why https://github.com/torvalds/linux/commit/9f3fc7bcddcb51234e23494531f93ab60475e1c3 says “we see results going from 1m57s unpatched to 40s") > >> If you see issues with your Atmel TPM, I'd suggest to make the changes in the >> tpm_atmel code only (tpm_infineon would be a suitable example to look at) and >> discuss the timing values with Atmel TPMs experts, but reverting this patch and >> affecting all the other TPMs seems like a misunderstanding. > I understand that the patch https://github.com/torvalds/linux/commit/9f3fc7bcddcb51234e23494531f93ab60475e1c3 makes the timing more accurate, that surfaced the issue. I am not sure how the original 5ms timeout came up at very beginning. If there is already a TPM standard with the 5ms, I agree that TPM hardware manufacturers should follow that. If the 5ms in the driver code was defined based on the tests on which hardwares, then it is likely that the inaccurate msleep mislead the original author. But in either way, we probably need a fix in driver code instead of hardware fix or firmware fix to resolve the breakage. > > It is not only observed in the Atmel TPM in Rubrik’s inventory, but also has been observed in multiple our customers’ machines. Thus I think it has wide impact on Atmel TPM chips, and code fix is required. > > From this issue, it looks like the timeout value could be different across TPM chip manufacturers, probably the QA on such change is better to be required for all TPM chip manufacturers in the future ? It looks like https://github.com/torvalds/linux/commit/9f3fc7bcddcb51234e23494531f93ab60475e1c3 has only tested IFX and STM, which potentially is the root cause of the issue. I am not familiar with the QA process of linux kernel, so correct me if I am wrong. > > Regarding of the final fix in the master, I agree that Atmel-specific fix is probably better than global fix. I can take a further look in this path, if folks all agree on that. Let me know your thoughts. > > Thanks > Hao > >> On Sep 12, 2020, at 7:17 AM, hamza@xxxxxxx wrote: >> >> Hello everyone, >> >> Dear Hao, >> >> The first aim of the patch you mentioned is not to shorten TPM timings, as no >> timing is changed at all on this patch, but to count time more accurately. >> >> If your TPM needs more time to perform a specific operation, it has nothing to >> do with this patch, but rather with its specified timings in the kernel. The >> timings for your Atmel TPM were certainly wrong before 4.14 to begin with, they >> were not counted accurately and your TPM working was potentially a "positive" >> side effect of the imprecise time counting (ie non-specified delays were added >> every time msleep was used instead of usleep_range). >> TPM polling should not be affected because we use a different way to count >> timings in the kernel. >> >> If you see issues with your Atmel TPM, I'd suggest to make the changes in the >> tpm_atmel code only (tpm_infineon would be a suitable example to look at) and >> discuss the timing values with Atmel TPMs experts, but reverting this patch and >> affecting all the other TPMs seems like a misunderstanding. >> >> Thanks, >> Hamza ATTAK. >> >> On Sat, Sep 12, 2020 at 10:14, Paul Menzel <pmenzel@xxxxxxxxxxxxx> wrote: >>> Dear Hao, >>> Thank you for the reply. >>> Am 12.09.20 um 10:10 schrieb Hao Wu: >>>> Thanks for quick responses over this report. >>> Thank you for the quick follow-up. >>>>> Hao, I wouldn´t expect a longer timeout causing the TPM to be >>>>> queried less frequently, but I do not know the code well. >>>> From our understanding, the TPM queries might be retried due to some >>>> reason, increase timeout 3x would lower the query frequency to 1/3. >>>> The explanation might be wrong, but the fact looks like the timeout >>>> matters. Unfortunately, engineers from Rubrik are not experts over >>>> the TPM driver code neither :( >>>>>> Be careful about making this a global change. It could reduce >>>>>> the TPM performance by 3x. We don't want to affect all TPMs to >>>>>> fix a bug in an old TPM 1.2 chip from one vendor. >>>>> Linux has a no regression policy, so the performance penalty >>>>> wouldn´t matter. Unfortunately, the regression was only noticed >>>>> several years after being introduced in Linux v4.14-rc2. >>>> So does that mean we are good to apply the global change ? Or we need >>>> to discuss about the actual fix further? >>> To get a fix into the stable series, it first needs to be applied to the master branch. I guess you tested also with Linux master, right? >>> Please add the explanation from your email to Greg into the git commit message, format the patch with `git format-patch -1 -o outgoing` and send it with `git send-email outgoing/*` to the addresses listed for the subsystem in `MAINTAINERS` and the people listed in the commit introducing the regression. >>> Then it can be properly reviewed and discussed. >>> Kind regards, >>> Paul >> >> >