Andy Walls <awalls@xxxxxxxxx> writes: > On Sun, 2010-03-21 at 21:08 +0100, Bjørn Mork wrote: >> As discussed many times, e.g. in http://lkml.org/lkml/2007/7/26/401 >> mixing IRQF_DISABLED with IRQF_SHARED just doesn't make sense. >> >> Remove IRQF_DISABLED to avoid random unexpected behaviour. >> >> Ever since I started using the saa7146 driver, I've had occasional >> soft lockups. I do not have any real evidence that the saa7146 >> driver is the cause, but the lockups are gone after removing the >> IRQF_DISABLED flag from this driver. >> >> On my system, this driver shares an irq17 with the pata_jmicron >> driver: >> >> 17: 2115 10605 9422844 8193902 IO-APIC-fasteoi pata_jmicron, saa7146 (0) >> >> This may be a mitigating factor. >> >> Signed-off-by: Bjørn Mork <bjorn@xxxxxxx> >> Cc: stable@xxxxxxxxxx > > And here are some more recent discussions: > > http://lkml.org/lkml/2009/11/30/215 > http://lkml.org/lkml/2009/3/2/33 > http://lkml.org/lkml/2009/3/2/225 > http://www.mail-archive.com/ivtv-devel@xxxxxxxxxxxxxx/msg06319.html > http://www.mail-archive.com/ivtv-devel@xxxxxxxxxxxxxx/msg06362.html > > And the ones on the LKML seem prettry inconclusive to me. OK, I don't really feel up to arguing against any of those. But the argument seems to be more along the lines of whether the requirement should be always disabled or always enabled. Most people seem to agree that it should be one or the other, and *not* a mix, and hence that the IRQF_DISABLED should go away (but possibly be replaced by disabled as default behaviour). The discussion about which is correct, always disabled or always enabled, is way out of my league. But I believe that current drivers have to adapt to the current kernel default, and that is always enabled. The patch in http://lkml.org/lkml/2009/3/2/33 might be the correct solution eventually, but attempting to implement this in a handful drivers is not going to work. By using IRQF_DISABLED you are only triggering the bugs which makes Linus hesitate to take that patch, in a very random and rather undebuggable way. That's not good. To quote one of Linus' followups to http://lkml.org/lkml/2009/3/2/33 (in http://lkml.org/lkml/2009/3/2/186): "The whole - and only - point is that there are drivers that are _known_ to require non-IRQF_DISABLED semantics. IDE is one such one." *This* is what makes using IRQF_SHARED || IRQF_DISABLED risky, IMHO. You can't currently guarantee that you don't share the line with one of those drivers. > If the saa7146 driver was registered second, then this change should > have no effect on your system. > > If the saa7146 driver was registered first, then this can cause the > saa7146 driver's interrupt handler to be interrupted. I doubt the > saa7146 driver is prepared for this contingency. > > I doubt that this is the "proper" fix for your problem. No, maybe it's not. But doesn't the fact that you can't predict the actual effect of the IRQF_DISABLED flag tell you that using it is wrong? Removing it will at least provide a predictable outcome. The problem you miss above is how the other drivers sharing this interrupt will deal with the, to them, unexpected occasional disabled interrupts. How the heck do you ensure that they can handle it? I assume the real fix would be to ensure that the saa7146 interrupt handler can be interrupted. I have no idea how to to that. Do you know why it can't be interrupted? It doesn't look particularily uninterruptable to me. And as you say: It will be interrupted even with the IRQF_DISABLED flag. The alternative to making ensure that the saa7146 interrupt handler can be interrupted is really not keeping IRQF_DISABLED, but - making sure that *every* interrupt handler in the kernel can run with interrupts disabled, and - change the default to running with interrupts disabled I believe it's easier to modify the saa7146 driver... > Does the "soft lockup" put an Oops or BUG message in dmesg > or /var/log/messages? > > What precisely do you mean by "soft lockup"? I mean CPU cores getting stuck, like: Mar 20 01:02:56 canardo kernel: [96555.159999] BUG: soft lockup - CPU#0 stuck for 61s! [lmcoretemp:7424] Mar 20 01:02:56 canardo kernel: [96603.480497] BUG: soft lockup - CPU#1 stuck for 61s! [kswapd0:47] Mar 20 01:02:56 canardo kernel: [96603.480999] BUG: soft lockup - CPU#2 stuck for 61s! [rndc:9119] Mar 20 01:02:56 canardo kernel: [96620.659996] BUG: soft lockup - CPU#0 stuck for 61s! [lmcoretemp:7424] Mar 20 01:02:56 canardo kernel: [96668.976496] BUG: soft lockup - CPU#1 stuck for 61s! [kswapd0:47] Mar 20 01:02:56 canardo kernel: [96668.976995] BUG: soft lockup - CPU#2 stuck for 61s! [rndc:9119] Mar 20 01:02:56 canardo kernel: [96686.159997] BUG: soft lockup - CPU#0 stuck for 61s! [lmcoretemp:7424] As the syslogging is local, you can't really trust syslog's timestamp. But the kernel timestamps should be accurate. The incident above started with Mar 20 01:02:56 canardo kernel: [96513.561007] saa7146 (0) saa7146_i2c_writeout [irq]: timed out waiting for end of xfer Mar 20 01:02:56 canardo kernel: [96513.577006] DVB: TDA10023(0): tda10023_writereg, writereg error (reg == 0x0d, val == 0x86, ret == -5) Mar 20 01:02:56 canardo kernel: [96513.589006] saa7146 (0) saa7146_i2c_writeout [irq]: timed out waiting for end of xfer Mar 20 01:02:56 canardo kernel: [96513.617006] saa7146 (0) saa7146_i2c_writeout [irq]: timed out waiting for end of xfer Mar 20 01:02:56 canardo kernel: [96513.645006] saa7146 (0) saa7146_i2c_writeout [irq]: timed out waiting for end of xfer Mar 20 01:02:56 canardo kernel: [96513.673005] saa7146 (0) saa7146_i2c_writeout [irq]: timed out waiting for end of xfer Mar 20 01:02:56 canardo kernel: [96513.689006] DVB: TDA10023(0): tda10023_writereg, writereg error (reg == 0x3d, val == 0x00, ret == -5) Mar 20 01:02:56 canardo kernel: [96513.701006] saa7146 (0) saa7146_i2c_writeout [irq]: timed out waiting for end of xfer Mar 20 01:02:56 canardo kernel: [96513.729006] saa7146 (0) saa7146_i2c_writeout [irq]: timed out waiting for end of xfer Mar 20 01:02:56 canardo kernel: [96513.757006] saa7146 (0) saa7146_i2c_writeout [irq]: timed out waiting for end of xfer Mar 20 01:02:56 canardo kernel: [96513.785006] saa7146 (0) saa7146_i2c_writeout [irq]: timed out waiting for end of xfer Mar 20 01:02:56 canardo kernel: [96513.801005] DVB: TDA10023(0): tda10023_writereg, writereg error (reg == 0x05, val == 0x26, ret == -5) Mar 20 01:02:56 canardo kernel: [96513.813005] saa7146 (0) saa7146_i2c_writeout [irq]: timed out waiting for end of xfer Mar 20 01:02:56 canardo kernel: [96513.841007] saa7146 (0) saa7146_i2c_writeout [irq]: timed out waiting for end of xfer Mar 20 01:02:56 canardo kernel: [96513.869005] saa7146 (0) saa7146_i2c_writeout [irq]: timed out waiting for end of xfer Mar 20 01:02:56 canardo kernel: [96513.897007] saa7146 (0) saa7146_i2c_writeout [irq]: timed out waiting for end of xfer and a few thousand more of those before: Mar 20 01:02:56 canardo kernel: [96712.636963] Clocksource tsc unstable (delta = 1700735019709 ns) Which I believe is pretty weird... Mar 20 01:02:58 canardo kernel: [96714.968443] Switching to clocksource hpet OK, consequence of the above. Mar 20 01:02:58 canardo kernel: [96714.973077] saa7146: unregister extension 'budget_av'. Mar 20 01:02:58 canardo kernel: [96714.973536] budget_av 0000:05:01.0: PCI INT A disabled And this seems to be what "solved" the problem. I don't know why budget_av was unloaded here. It was not manual intervention. Does it unload itself due to the extensive timeouts? Anyway, disabling the IRQ allowed the machine to continue. Unfortunately with rather ugly side effects like libata giving up on a few of the disks and disabling them, but still better than just crashing... I assume you can argue that this is caused by a bug in some other driver's interrupt handler. And you are of course right. But that bug is in practice allowed today, due to the fact that the default is running with interrupts enabled. Allowing saa7146 to run with interrupts enabled hides that bug for me. Bjørn -- To unsubscribe from this list: send the line "unsubscribe linux-media" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html