On Fri Jan 31, 2025 at 10:35 AM EET, Michal Suchánek wrote: > Hello, > > On Fri, Jan 31, 2025 at 01:31:01AM +0200, Jarkko Sakkinen wrote: > > On Wed Jan 29, 2025 at 6:02 PM EET, Jonathan McDowell wrote: > > > On Wed, Jan 29, 2025 at 04:27:15PM +0100, Michal Suchánek wrote: > > > > there is a problem report that booting a specific type of system about > > > > 0.1% of the time encrypted volume (using a PCR to release the key) fails > > > > to unlock because of TPM operation timeout. > > > > > > > > Minimizing the test case failed so far. > > > > > > > > For example, booting into text mode as opposed to graphical desktop > > > > makes the problem unreproducible. > > > > > > > > The test is done with a frankenkernel that has TPM drivers about on par > > > > with Linux 6.4 but using actual Linux 6.4 the problem is not > > > > reproducible, either. > > > > > > > > However, given the problem takes up to a day to reproduce I do not have > > > > much confidence in the negative results. > > > > > > So. We see what look like similar timeouts in our fleet, but I haven't > > > managed to produce a reliable test case that gives me any confidence > > > about what the cause is. > > > > > > https://lore.kernel.org/linux-integrity/Zv1810ZfEBEhybmg@xxxxxxxx/ > > > > > > for my previous post about this. > > > > Ugh, this was my first week at new job, sorry. > > > > 2000 ms is like a spec value, which can be a bad idea. Please look at > > Table 18. > > > > My guess is that GUI makes more stuff happening in the system, which > > could make latencies more shaky. > > > > The most trivial candidate would be: > > > > status = tpm_tis_status(chip); > > if ((status & TPM_STS_COMMAND_READY) == 0) { > > tpm_tis_ready(chip); > > if (wait_for_tpm_stat > > (chip, TPM_STS_COMMAND_READY, TPM_TIS_TIMEOUT_MAX /* e.g. 2250 ms */, > > 2250 is more than the measured 2226 but I have no idea if that's random > or in some way deterministic. Your text vs GUI at least gives evidence of stochasticity while not a full-fledged proof. You can expect e.g. more IRQs happening when you run a GUI. I did not engineer that number. You could e.g. double the original number. The whole framework for timeout_b is ridiculous (if it is because of me it does not change that fact). Or perhaps we could consider even wait_event_interruptible() inside wait_for_tpm_stat(), since it is interruptible. BR, Jarkko