On Tue, 2019-03-12 at 20:08 +0000, Calvin Owens wrote: > On Tuesday 03/12 at 13:04 -0400, Mimi Zohar wrote: > > On Mon, 2019-03-11 at 16:54 -0700, Calvin Owens wrote: > > > We're having lots of problems with TPM commands timing out, and we're > > > seeing these problems across lots of different hardware (both v1/v2). > > > > > > I instrumented the driver to collect latency data, but I wasn't able to > > > find any specific timeout to fix: it seems like many of them are too > > > aggressive. So I tried replacing all the timeout logic with a single > > > universal long timeout, and found that makes our TPMs 100% reliable. > > > > > > Given that this timeout logic is very complex, problematic, and appears > > > to serve no real purpose, I propose simply deleting all of it. > > > > Normally before sending such a massive change like this, included in > > the bug report or patch description, there would be some indication as > > to which kernel introduced a regression. Has this always been a > > problem? Is this something new? How new? > > Honestly we've always had problems with flakiness from these devices, > but it seems to have regressed sometime between 4.11 and 4.16. Well, that's a start. Around 4.10 is when we started noticing TPM performance issues due to the change in the kernel timer scheduling. This resulted in commit a233a0289cf9 ("tpm: msleep() delays - replace with usleep_range() in i2c nuvoton driver"), which was upstreamed in 4.12. At the other end, James was referring to commit "424eaf910c32 tpm: reduce polling time to usecs for even finer granularity", which was introduced in 4.18. > > I wish a had a better answer for you: we need on the order of a hundred > machines to see the difference, and setting up these 100+ machine tests > is unfortunately involved enough that e.g. bisecting it just isn't > feasible :/ > What I can say for sure is that this patch makes everything much better > for us. If there's anything in particular you'd like me to test, I have > an army of machines I'm happy to put to use, let me know :) I would assume not all of your machines are the same nor have the same TPM. Could you verify that this problem is across the board, not limited to a particular TPM. BTW, are you seeing this problem with both TPM 1.2 or 2.0? thanks! Mimi