I've not got too far debugging this, but thought I'd post here in case anyone had any bright ideas... I have a single box with multiple drives and controller cards - it's Intel Sandy Bridge based with a Gigabyte GA-PH67-UD3-B3 motherboard. The problem has appeared with the Syba/IOCrest "PEX40008" Silicon Image 3124 cards - these are a slightly unusual card with the PCI-X SiI3124 chip sitting behind a PCI Express (single lane PCIe 1.0) to PCI-X bridge. 01:00.0 PCI bridge: Pericom Semiconductor Device e111 (rev 02) 02:04.0 Mass storage controller: Silicon Image, Inc. SiI 3124 PCI-X Serial ATA Controller (rev 02) apparently identical to this one: http://www.ebay.com/itm/171387103100 Under heavy I/O (e.g. raid rebuild, or multiple 'dd'), after a few minutes the box locks up, normally with messages like: sata_sil24 0000:02:04.0: IRQ status == 0xffffffff, PCI fault or device removal? (ref my recent patch to identify which card gets the IRQ status error message). The issue appears with multiple cards bought at different times from different vendors. I can't get this lock-up to occur with the 3124 cards on the 3.2 kernel. If I move all the active disks to Silicon Image 3132 based controllers instead, I don't see the lock-up on either 3.2 or 3.16-rc6 I'm currently testing using a single 3124 PCI card behind a different PCIe to PCI bridge chip (an ITE 8892 on the motherboard) and haven't had any lockups after a couple of hours, but this controller is limited to about 75 MB/sec throughput, due to being on a 33MHz PCI bus (limitation of the ITE 8892 bridge chip), whereas the PEX40008 cards get circa 200 MB/sec throughput (limited by a single PCIe slot running at PCIe 1.0 speeds). /proc/interrupts [excerpt below] on this box shows shared IRQs to the Silicon Image cards: 16: 4053039 0 IO-APIC-fasteoi sata_sil24, sata_sil24 18: 82 573633 IO-APIC-fasteoi sata_sil24, ehci_hcd:usb1, i801_smbus 19: 114 5553654 IO-APIC-fasteoi sata_sil24, sata_sil24 23: 37 0 IO-APIC-fasteoi ehci_hcd:usb2 41: 184317 0 PCI-MSI-edge eth0 42: 184364 0 PCI-MSI-edge ahci 43: 3 3489756 PCI-MSI-edge ahci 44: 12 0 PCI-MSI-edge mei_me 45: 422 0 PCI-MSI-edge snd_hda_intel ... and seeing as the fault seems to be IRQ related, I thought I'd use the "msi=1" parameter to the sata_sil24 driver to try and enable message signalled interrupts, but I didn't have any luck on this - The 3132 cards enumerate drives correctly and receive a few hundred or a few thousand interrupts (e.g. 2050 on one card, ans 2041 to the other in the test I just ran), and thereafter don't get any further interrupts, causing subsequent I/O to fail: ~# hdparm -t /dev/sdi /dev/sdi: Timing buffered disk reads: read(2097152) returned 131072 bytes ... whereas the 3124 cards don't receive any interrupts at all (all counts zero in /proc/interrupts). MSI is already in use and working fine for the Intel H67 AHCI controller, and also an ASMedia ASM1062 based PCIe card in the same machine. Same MSI failure on both the 3.2 and 3.16rc6 kernels on this box. Unfortunately, my opportunities to debug on this box are a bit limited due to pressure to keep it in service, but I have a couple of 3124 and 3132 cards that I can borrow from the machine in question and elsewhere, and put into another box to test. Any suggestions for things which would throw more light on the matter (I know that using git bisect is a possibility)? Tim. -- South East Open Source Solutions Limited Registered in England and Wales with company number 06134732. Registered Office: 2 Powell Gardens, Redhill, Surrey, RH1 1TQ VAT number: 900 6633 53 http://seoss.co.uk/ +44-(0)1273-808309 -- To unsubscribe from this list: send the line "unsubscribe linux-ide" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html