Re: [RFT] major libata update

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Jeff Garzik wrote:
Ric Wheeler wrote:

Jeff Garzik wrote:

TESTING:
* Although most drivers by count received few operational changes, the
common probe path was updated, so all drivers need fresh "yes, it sees
all my disks" regression testing.

* ahci and sata_sil24 were touched a lot, and so need additional
testing.

* sata_sil and ata_piix also need healthy re-testing of all basic
functionality.


I have been running a moderate write workload on this (built using linux-2.6.17-rc4 with your patch applied on top). Last night, I ran on a set of clean AHCI based boxes (no bad drives) and got a serious of occasional spurious interrupts logged:


May 15 21:24:38 centera kernel: ReiserFS: sdd14: Using r5 hash to sort names May 15 21:52:44 centera kernel: ata1: spurious interrupt (irq_stat 0x8 active_tag -84148995 sactive 0x800)
May 15 22:00:02 centera run-crons[26837]: logrotate returned 1
May 15 22:16:00 centera kernel: ata1: spurious interrupt (irq_stat 0x8 active_tag -84148995 sactive 0x4) May 15 22:29:14 centera kernel: ata1: spurious interrupt (irq_stat 0x8 active_tag -84148995 sactive 0x7e007fff) May 15 22:35:04 centera kernel: ata3: spurious interrupt (irq_stat 0x8 active_tag -84148995 sactive 0x4fffffff)

Full messages file and lspci below, but note that this hardware has been running ahci with this config in production for over a year now.

Definitely new behavior. In each case you have irq_stat == 0x8, which indicates a Set Device Bits FIS has been received.


Yeap, new behavior. Though, one thing to note is that the original ahci_host_intr() never bothered to report spurious interrupt. It always returned 1 telling ahci_interrupt() that the interrupt is handled. But as this is SDB instead of D2H, my guess is that the drive is sending spurious NCQ completions with no new command completed.

Hmm.. Can you try the attached patch and report what the kernel says?

The message reminds me of several things...

* can we make tags int and use -1 for invalid tag? ATA_TAG_POISON looks horrible when printed.

* it would be nice to have some framework to determine whether the controller is receiving too many consecutive spurious interrupts. Say, 32 irqs in a row without intervening valid interrupts is a good reason to be suspicious about stuck IRQ. Freezing & resetting will resolve the situation in most cases.

* With NCQ, some drives generate spurious D2H FISes with I bit set as if it were executing non-NCQ commands. So, regardless of controller, we're likely to see similar problems (but sil24 does all the protocol handling and ignores such FISes by itself). This can be combined with the above freeze on too many spurious, I guess.

--
tejun
diff --git a/drivers/scsi/ahci.c b/drivers/scsi/ahci.c
index 45fd71d..506f0df 100644
--- a/drivers/scsi/ahci.c
+++ b/drivers/scsi/ahci.c
@@ -916,10 +916,19 @@ static void ahci_host_intr(struct ata_po
 			return;
 	}
 
-	if (ata_ratelimit())
+	if (ata_ratelimit()) {
 		ata_port_printk(ap, KERN_INFO, "spurious interrupt "
 				"(irq_stat 0x%x active_tag %d sactive 0x%x)\n",
 				status, ap->active_tag, ap->sactive);
+		if (status & PORT_IRQ_SDB_FIS) {
+			struct ahci_port_priv *pp = ap->private_data;
+			u32 *sdb_fis = pp->rx_fis + 0x58;
+
+			ata_port_printk(ap, KERN_INFO, "spurious SDB FIS "
+				"%08x:%08x ap->qc_active=%08x qc_active=%08x\n",
+				sdb_fis[0], sdb_fis[1], ap->qc_active, qc_active);
+		}
+	}
 }
 
 static void ahci_irq_clear(struct ata_port *ap)

[Index of Archives]     [Linux Filesystems]     [Linux SCSI]     [Linux RAID]     [Git]     [Kernel Newbies]     [Linux Newbie]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Samba]     [Device Mapper]

  Powered by Linux