Re: TPM operation times out (very rarely)

Michal Suchánek <msuchanek@xxxxxxx> · Fri, 31 Jan 2025 09:35:45 +0100

Hello,

On Fri, Jan 31, 2025 at 01:31:01AM +0200, Jarkko Sakkinen wrote:
> On Wed Jan 29, 2025 at 6:02 PM EET, Jonathan McDowell wrote:
> > On Wed, Jan 29, 2025 at 04:27:15PM +0100, Michal Suchánek wrote:
> > > there is a problem report that booting a specific type of system about
> > > 0.1% of the time encrypted volume (using a PCR to release the key) fails
> > > to unlock because of TPM operation timeout.
> > > 
> > > Minimizing the test case failed so far.
> > > 
> > > For example, booting into text mode as opposed to graphical desktop
> > > makes the problem unreproducible.
> > > 
> > > The test is done with a frankenkernel that has TPM drivers about on par
> > > with Linux 6.4 but using actual Linux 6.4 the problem is not
> > > reproducible, either.
> > > 
> > > However, given the problem takes up to a day to reproduce I do not have
> > > much confidence in the negative results.
> >
> > So. We see what look like similar timeouts in our fleet, but I haven't
> > managed to produce a reliable test case that gives me any confidence
> > about what the cause is.
> >
> > https://lore.kernel.org/linux-integrity/Zv1810ZfEBEhybmg@xxxxxxxx/
> >
> > for my previous post about this.
> 
> Ugh, this was my first week at new job, sorry.
> 
> 2000 ms is like a spec value, which can be a bad idea. Please look at
> Table 18.
> 
> My guess is that GUI makes more stuff happening in the system, which
> could make latencies more shaky.
> 
> The most trivial candidate would be:
> 
> 	status = tpm_tis_status(chip);
> 	if ((status & TPM_STS_COMMAND_READY) == 0) {
> 		tpm_tis_ready(chip);
> 		if (wait_for_tpm_stat
> 		    (chip, TPM_STS_COMMAND_READY, TPM_TIS_TIMEOUT_MAX /* e.g. 2250 ms */,

2250 is more than the measured 2226 but I have no idea if that's random
or in some way deterministic.

> 		     &priv->int_queue, false) < 0) {
> 		     	rc = -ETIME;
> 			goto out_err;
> 		}
> 	}
> 
> On the other hand, for me tpm_tis_send_main() looked initially weird:
> 
> 	for (try = 0; try < TPM_RETRY; try++) {
> 		rc = tpm_tis_send_data(chip, buf, len);
> 		if (rc >= 0)
> 			/* Data transfer done successfully */
> 			break;
> 		else if (rc != -EIO)
> 			/* Data transfer failed, not recoverable */
> 			return rc;
> 	}
> 
> I.e. no retry on -ETIME.
> 
> But I'd fixup instead tpm_common_write():
> 
> out:
> 	mutex_unlock(&priv->buffer_mutex);
> 
> 	if (ret == -ETIME)
> 		return -ERESTARTSYS;
> 
> 	return ret;
> }
> 
> It still can be interrupted by a signal this way. Retry loop would
> block too much.

Not sure if this would help. As was noted in the discussion so far if
the value is consumed by the kernel it will likely not retry in the
upper layer code.

Also restarting the userspace cryptsetup service reportedly did not help
addressing the problem which suggests that the consumer is indeed the
kernel, and it marked something as defunct and gave up on getting the
key from the TPM entirely.

Also the loop can already block for up to 2s. If blocking in the loop is
a problem then it should be addressed in that loop.

Thanks

Michal

> 
> Not sure if only the increase in timeout value would be enough or
> should the both sites be fixed up.
> 
> [1] https://trustedcomputinggroup.org/wp-content/uploads/PC-Client-Specific-Platform-TPM-Profile-for-TPM-2p0-v1p05p_r14_pub.pdf
> 
> BR, Jarkko