Hi, On 23.05.23 20:53, Jarkko Sakkinen wrote: > ATTENTION: This e-mail is from an external sender. Please check attachments and links before opening e.g. with mouseover. > > > On Mon May 22, 2023 at 5:31 PM EEST, Lino Sanfilippo wrote: >> From: Lino Sanfilippo <l.sanfilippo@xxxxxxxxxx> >> >> Commit e644b2f498d2 ("tpm, tpm_tis: Enable interrupt test") enabled >> interrupts instead of polling on all capable TPMs. Unfortunately, on some >> products the interrupt line is either never asserted or never deasserted. >> >> The former causes interrupt timeouts and is detected by >> tpm_tis_core_init(). The latter results in interrupt storms. >> >> Recent reports concern the Lenovo ThinkStation P360 Tiny, Lenovo ThinkPad >> L490 and Inspur NF5180M6: >> >> https://lore.kernel.org/linux-integrity/20230511005403.24689-1-jsnitsel@xxxxxxxxxx/ >> https://lore.kernel.org/linux-integrity/d80b180a569a9f068d3a2614f062cfa3a78af5a6.camel@xxxxxxxxxx/ >> >> The current approach to avoid those storms is to disable interrupts by >> adding a DMI quirk for the concerned device. >> >> However this is a maintenance burden in the long run, so use a generic >> approach: > > I'm trying to comprehend how you evaluate, how big maintenance burden > this would be. Adding even a few dozen table entries is not a > maintenance burden. > > On the other hand any new functionality is objectively a maintanance > burden of some measure (applies to any functionality). So how do we know > that taking this change is less of a maintenance burden than just add > new table entries, as they come up? > Initially this set was created as a response to this 0-day bug report which you asked me to have a look at: https://lore.kernel.org/linux-integrity/d80b180a569a9f068d3a2614f062cfa3a78af5a6.camel@xxxxxxxxxx/ My hope was that it could also avoid some of (existing or future) DMI entries. But even if it does not (e.g. the problem Péter Ujfalusi reported with the UPX-i11 cannot be fixed by this patch set and thus needs the DMI quirk) we may at least avoid more bug reports due to interrupt storms once 6.4 is released. >> Detect an interrupt storm by counting the number of unhandled interrupts >> within a 10 ms time interval. In case that more than 1000 were unhandled >> deactivate interrupts, deregister the handler and fall back to polling. > > I know it can be sometimes hard to evaluate but can you try to explain > how you came up to the 10 ms sampling period and 1000 interrupt > threshold? I just don't like abritrary numbers. At least the 100 ms is not plucked out of thin air but its the same time period that the generic code in note_interrupt() uses - I assume for a good reason. Not only this number but the whole irq storm detection logic is taken from there: > >> This equals the implementation that handles interrupt storms in >> note_interrupt() by means of timestamps and counters in struct irq_desc. The number of 1000 unhandled interrupts is still far below the 99900 used in note_interrupt() but IMHO enough to indicate that there is something seriously wrong with interrupt processing and it is probably saver to fall back to polling. Regards, Lino