Re: [PATCH 1/2] tpm, tpm_tis: Handle interrupt storm

Péter Ujfalusi <peter.ujfalusi@xxxxxxxxxxxxxxx> · Mon, 29 May 2023 09:46:08 +0300

Hi Lino,

On 23/05/2023 23:46, Lino Sanfilippo wrote:
>> On the other hand any new functionality is objectively a maintanance
>> burden of some measure (applies to any functionality). So how do we know
>> that taking this change is less of a maintenance burden than just add
>> new table entries, as they come up?
>>
> 
> Initially this set was created as a response to this 0-day bug report which you asked me
> to have a look at:
> 
> https://lore.kernel.org/linux-integrity/d80b180a569a9f068d3a2614f062cfa3a78af5a6.camel@xxxxxxxxxx/
> 
> My hope was that it could also avoid some of (existing or future) DMI entries. But even if it does not
> (e.g. the problem Péter Ujfalusi reported with the UPX-i11 cannot be fixed by this patch set and thus
> needs the DMI quirk) we may at least avoid more bug reports due to interrupt storms once
> 6.4 is released. 

I'm surprised that there is a need for a storm detection in the first
place... Do we have something else on the same IRQ line on the affected
devices which might have a bug or no driver at all?
It is hard to believe that a TPM (Trusted Platform Module) is integrated
so poorly ;)

But put that aside: I think the storm detection is good given that there
is no other way to know which machine have sloppy TPM integration.
There are machines where this happens, so it is a know integration
issue, right?

My only 'nitpick' is with the printk level to be used.
The ERR level is not correct as we know the issue and we handle it, so
all is under control.
If we want to add these machines to the quirk list then WARN is a good
level to gain attention but I'm not sure if a user will know how to get
the machine in the quirk (where to file a bug).
If we only want the quirk to be used for machines like UPX-i11 which
simply just have broken (likely floating) IRQ line then the WARN is too
high level, INFO or even DBG would be appropriate as you are not going
to update the quirk, it is just handled under the hood (which is a great
thing, but on the other hand you will have the storm never the less and
that is not a nice thing).

It is a matter on how this is going to be handled in a long term. Add
quirk for all the known machines with either stormy or plain broken IRQ
line or handle the stormy ones and quirk the broken ones only.

>>> Detect an interrupt storm by counting the number of unhandled interrupts
>>> within a 10 ms time interval. In case that more than 1000 were unhandled
>>> deactivate interrupts, deregister the handler and fall back to polling.
>>
>> I know it can be sometimes hard to evaluate but can you try to explain
>> how you came up to the 10 ms sampling period and 1000 interrupt
>> threshold? I just don't like abritrary numbers.
> 
> At least the 100 ms is not plucked out of thin air but its the same time period
> that the generic code in note_interrupt() uses - I assume for a good reason.
> Not only this number but the whole irq storm detection logic is taken from 
> there: 
> 
>>
>>> This equals the implementation that handles interrupt storms in
>>> note_interrupt() by means of timestamps and counters in struct irq_desc.
>> The number of 1000 unhandled interrupts is still far below the 99900
used in
> note_interrupt() but IMHO enough to indicate that there is something seriously
> wrong with interrupt processing and it is probably saver to fall back to polling.

Except that if the line got the spurious designation in core, the
interrupt line will be disabled while the TPM driver will think that it
is still using IRQ mode and will not switch to polling.

A storm of 1000 is better than a storm of 99900 for sure but quirking
these would be the desired final solution. imho

There are many buts around this ;)

-- 
Péter