On 10/08/2009 12:36 PM, Christian Pernegger wrote:
Hi all! My new box doesn't seem to like its sata_sil24 controllers under load. The attached log snippet (syslog.gz) is of the first occurrence of the error, some tens of minutes into a checkarray --all (basically does echo check>/sys/block/md*/md/sync_action). End result was a hang where not even Alt-SysRq would do any good. It isn't md, though, just running badblocks on all disks on one of the sata_sil24s in parallel does the trick as well. The error messages are not always exactly the same and do not always result in a hang of the whole machine. More recently I've had: [ 632.710900] sata_sil24: IRQ status == 0xffffffff, PCI fault or device removal? [ 632.820017] ata13.00: exception Emask 0x20 SAct 0xffff SErr 0x0 action 0x6 frozen [ 632.820073] ata11.00: exception Emask 0x20 SAct 0xbfff SErr 0x0 action 0x6 frozen [ 632.820076] ata11.00: irq_stat 0x00020002, PCI master abort while transferring data
Looks like there's something unhappy between the card and the motherboard..
[ 632.820083] ata11.00: cmd 60/00:00:3f:62:45/04:00:00:00:00/40 tag 0 ncq 524288 in [ 632.820085] res 6c/0b:02:02:00:00/00:00:00:00:6c/00 Emask 0x22 (host bus error) [ 632.820087] ata11.00: status: { DRDY DF DRQ } [ 632.820093] ata11.00: cmd 60/00:08:3f:46:45/04:00:00:00:00/40 tag 1 ncq 524288 in [ 632.820094] res 6c/0b:02:02:00:00/00:00:00:10:6c/00 Emask 0x22 (host bus error) [ 632.820096] ata11.00: status: { DRDY DF DRQ } ... and so on for the other in-flight tags ... Hard- and Software: Tyan Toledo iE3210W (S5211), 6x SATA-300 [ahci] Intel C2Q 9550s 8GB Crucial DDR2-ECC RAM 2x Dawicontrol DC-4320 RAID in the PCI-X 133 slots, 4x SATA-300 [sata_sil24] each, RAID BIOS' disabled via jumper
Please see the recent post from Bernie Innocenti "sata_mv 0000:03:06.0: PCI ERROR; PCI IRQ cause=0x30000040". I suspect you could have a similar problem with the system running at too high a PCI-X bus clock speed with two cards installed.
3x RaidSonic IB-554SSK "backplane" (datasheet: http://www.raidsonic.de/de/data/data_pdf/icybox/datasheet_ib-555_554_553_d.pdf) Debian stable (lenny) backplane 1 (onboard ahci controller): - 2x WD5000YS in raid1 = sd[ab][12] - 2x WD10EADS in raid1 = sd[cd]1 These work flawlessly. backplane 2 (sata_sil24 on PCI-X 133): - 4x WD1000FYPS = sd[efgh]1 backplane 3 (sata_sil24 on PCI-X 133): - 3x WD1002FBYS = sd[ijk] - 1x empty tray Tried temorarily powering half the disks via a second power supply, tried exchanging the power supply, tried switching around cables to maybe isolate a culprit backplane or controller. That last one actually looked promising for a while, but testing wasn't conclusive. Tried a bunch of dirrerent kernels: 2.6.26-19 (lenny) and 2.6.30-7 (lenny-backports), 2.6.32-rc3 (vanilla). Not much difference. Unfortunately just scrapping the box isn't an option as this is a personal project and the budget's just too tight ATM. Any pointers on what I could try to narrow down which component is faling or what's going on in general? Also see the attached dmesg, although that's NOT from the same boot as the syslog snippet. Thank you, Christian
-- To unsubscribe from this list: send the line "unsubscribe linux-ide" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html