On Tue May 30, 2023 at 8:56 PM EEST, Jerry Snitselaar wrote: > On Mon, May 29, 2023 at 09:46:08AM +0300, Péter Ujfalusi wrote: > > Hi Lino, > > > > On 23/05/2023 23:46, Lino Sanfilippo wrote: > > >> On the other hand any new functionality is objectively a maintanance > > >> burden of some measure (applies to any functionality). So how do we know > > >> that taking this change is less of a maintenance burden than just add > > >> new table entries, as they come up? > > >> > > > > > > Initially this set was created as a response to this 0-day bug report which you asked me > > > to have a look at: > > > > > > https://lore.kernel.org/linux-integrity/d80b180a569a9f068d3a2614f062cfa3a78af5a6.camel@xxxxxxxxxx/ > > > > > > My hope was that it could also avoid some of (existing or future) DMI entries. But even if it does not > > > (e.g. the problem Péter Ujfalusi reported with the UPX-i11 cannot be fixed by this patch set and thus > > > needs the DMI quirk) we may at least avoid more bug reports due to interrupt storms once > > > 6.4 is released. > > > > I'm surprised that there is a need for a storm detection in the first > > place... Do we have something else on the same IRQ line on the affected > > devices which might have a bug or no driver at all? > > It is hard to believe that a TPM (Trusted Platform Module) is integrated > > so poorly ;) > > > > But put that aside: I think the storm detection is good given that there > > is no other way to know which machine have sloppy TPM integration. > > There are machines where this happens, so it is a know integration > > issue, right? > > > > My only 'nitpick' is with the printk level to be used. > > The ERR level is not correct as we know the issue and we handle it, so > > all is under control. > > If we want to add these machines to the quirk list then WARN is a good > > level to gain attention but I'm not sure if a user will know how to get > > the machine in the quirk (where to file a bug). > > If we only want the quirk to be used for machines like UPX-i11 which > > simply just have broken (likely floating) IRQ line then the WARN is too > > high level, INFO or even DBG would be appropriate as you are not going > > to update the quirk, it is just handled under the hood (which is a great > > thing, but on the other hand you will have the storm never the less and > > that is not a nice thing). > > > > It is a matter on how this is going to be handled in a long term. Add > > quirk for all the known machines with either stormy or plain broken IRQ > > line or handle the stormy ones and quirk the broken ones only. > > > > >>> Detect an interrupt storm by counting the number of unhandled interrupts > > >>> within a 10 ms time interval. In case that more than 1000 were unhandled > > >>> deactivate interrupts, deregister the handler and fall back to polling. > > >> > > >> I know it can be sometimes hard to evaluate but can you try to explain > > >> how you came up to the 10 ms sampling period and 1000 interrupt > > >> threshold? I just don't like abritrary numbers. > > > > > > At least the 100 ms is not plucked out of thin air but its the same time period > > > that the generic code in note_interrupt() uses - I assume for a good reason. > > > Not only this number but the whole irq storm detection logic is taken from > > > there: > > > > > >> > > >>> This equals the implementation that handles interrupt storms in > > >>> note_interrupt() by means of timestamps and counters in struct irq_desc. > > >> The number of 1000 unhandled interrupts is still far below the 99900 > > used in > > > note_interrupt() but IMHO enough to indicate that there is something seriously > > > wrong with interrupt processing and it is probably saver to fall back to polling. > > > > Except that if the line got the spurious designation in core, the > > interrupt line will be disabled while the TPM driver will think that it > > is still using IRQ mode and will not switch to polling. > > > > A storm of 1000 is better than a storm of 99900 for sure but quirking > > these would be the desired final solution. imho > > If that is the case, then output could probably be sent to the console > detailing the dmi info needed to update the table. +1 Good idea. BR, Jarkko