Jeff Garzik wrote:
Ric Wheeler wrote:
Jeff Garzik wrote:
TESTING:
* Although most drivers by count received few operational changes, the
common probe path was updated, so all drivers need fresh "yes, it sees
all my disks" regression testing.
* ahci and sata_sil24 were touched a lot, and so need additional
testing.
* sata_sil and ata_piix also need healthy re-testing of all basic
functionality.
I have been running a moderate write workload on this (built using
linux-2.6.17-rc4 with your patch applied on top). Last night, I ran
on a set of clean AHCI based boxes (no bad drives) and got a serious
of occasional spurious interrupts logged:
May 15 21:24:38 centera kernel: ReiserFS: sdd14: Using r5 hash to sort
names
May 15 21:52:44 centera kernel: ata1: spurious interrupt (irq_stat 0x8
active_tag -84148995 sactive 0x800)
May 15 22:00:02 centera run-crons[26837]: logrotate returned 1
May 15 22:16:00 centera kernel: ata1: spurious interrupt (irq_stat 0x8
active_tag -84148995 sactive 0x4)
May 15 22:29:14 centera kernel: ata1: spurious interrupt (irq_stat 0x8
active_tag -84148995 sactive 0x7e007fff)
May 15 22:35:04 centera kernel: ata3: spurious interrupt (irq_stat 0x8
active_tag -84148995 sactive 0x4fffffff)
Full messages file and lspci below, but note that this hardware has
been running ahci with this config in production for over a year now.
Definitely new behavior. In each case you have irq_stat == 0x8, which
indicates a Set Device Bits FIS has been received.
Yeap, new behavior. Though, one thing to note is that the original
ahci_host_intr() never bothered to report spurious interrupt. It always
returned 1 telling ahci_interrupt() that the interrupt is handled. But
as this is SDB instead of D2H, my guess is that the drive is sending
spurious NCQ completions with no new command completed.
Hmm.. Can you try the attached patch and report what the kernel says?
The message reminds me of several things...
* can we make tags int and use -1 for invalid tag? ATA_TAG_POISON looks
horrible when printed.
* it would be nice to have some framework to determine whether the
controller is receiving too many consecutive spurious interrupts. Say,
32 irqs in a row without intervening valid interrupts is a good reason
to be suspicious about stuck IRQ. Freezing & resetting will resolve the
situation in most cases.
* With NCQ, some drives generate spurious D2H FISes with I bit set as if
it were executing non-NCQ commands. So, regardless of controller, we're
likely to see similar problems (but sil24 does all the protocol handling
and ignores such FISes by itself). This can be combined with the above
freeze on too many spurious, I guess.
--
tejun
diff --git a/drivers/scsi/ahci.c b/drivers/scsi/ahci.c
index 45fd71d..506f0df 100644
--- a/drivers/scsi/ahci.c
+++ b/drivers/scsi/ahci.c
@@ -916,10 +916,19 @@ static void ahci_host_intr(struct ata_po
return;
}
- if (ata_ratelimit())
+ if (ata_ratelimit()) {
ata_port_printk(ap, KERN_INFO, "spurious interrupt "
"(irq_stat 0x%x active_tag %d sactive 0x%x)\n",
status, ap->active_tag, ap->sactive);
+ if (status & PORT_IRQ_SDB_FIS) {
+ struct ahci_port_priv *pp = ap->private_data;
+ u32 *sdb_fis = pp->rx_fis + 0x58;
+
+ ata_port_printk(ap, KERN_INFO, "spurious SDB FIS "
+ "%08x:%08x ap->qc_active=%08x qc_active=%08x\n",
+ sdb_fis[0], sdb_fis[1], ap->qc_active, qc_active);
+ }
+ }
}
static void ahci_irq_clear(struct ata_port *ap)