Re: [PATCH 1/2] tpm, tpm_tis: Handle interrupt storm

"Jarkko Sakkinen" <jarkko@xxxxxxxxxx> · Tue, 23 May 2023 22:12:49 +0300

On Tue May 23, 2023 at 9:53 PM EEST, Jarkko Sakkinen wrote:
> On Mon May 22, 2023 at 5:31 PM EEST, Lino Sanfilippo wrote:
> > From: Lino Sanfilippo <l.sanfilippo@xxxxxxxxxx>
> >
> > Commit e644b2f498d2 ("tpm, tpm_tis: Enable interrupt test") enabled
> > interrupts instead of polling on all capable TPMs. Unfortunately, on some
> > products the interrupt line is either never asserted or never deasserted.
> >
> > The former causes interrupt timeouts and is detected by
> > tpm_tis_core_init(). The latter results in interrupt storms.
> >
> > Recent reports concern the Lenovo ThinkStation P360 Tiny, Lenovo ThinkPad
> > L490 and Inspur NF5180M6:
> >
> > https://lore.kernel.org/linux-integrity/20230511005403.24689-1-jsnitsel@xxxxxxxxxx/
> > https://lore.kernel.org/linux-integrity/d80b180a569a9f068d3a2614f062cfa3a78af5a6.camel@xxxxxxxxxx/
> >
> > The current approach to avoid those storms is to disable interrupts by
> > adding a DMI quirk for the concerned device.
> >
> > However this is a maintenance burden in the long run, so use a generic
> > approach:
>
> I'm trying to comprehend how you evaluate, how big maintenance burden
> this would be. Adding even a few dozen table entries is not a
> maintenance burden.
>
> On the other hand any new functionality is objectively a maintanance
> burden of some measure (applies to any functionality). So how do we know
> that taking this change is less of a maintenance burden than just add
> new table entries, as they come up?
>
> > Detect an interrupt storm by counting the number of unhandled interrupts
> > within a 10 ms time interval. In case that more than 1000 were unhandled
> > deactivate interrupts, deregister the handler and fall back to polling.
>
> I know it can be sometimes hard to evaluate but can you try to explain
> how you came up to the 10 ms sampling period and 1000 interrupt
> threshold? I just don't like abritrary numbers.

Also here I wonder how you came up with this computational model. This
is not same as saying it is wrong. There's just whole stack of options.

Out of top of my head you could e.g. window average the duration between
IRQs. When the average goes beyond threshold, then you shutdown
interrupts.

The pro I would see in this that it is much easier intuitively discuss
how much there should be time in-between interrupts that the kernel
handles it, than how many IRQs you can stack into time interval, which
blows my head tbh.

BR, Jarkko