Hi all, Hopefully someone here will know what's up with my machine. It's an nforce4 ultra box that's running a 10-drive RAID5 array. I upgraded from 2.6.17-rc4 to 2.6.18.3 about a week ago, and I've since had 3 drives kicked out. Previously, I had no kicks over almost a year. The kernel message is: ata7.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0 ata7.00: (BMDMA stat 0x20) ata7.00: tag 0 cmd 0xc8 Emask 0x1 stat 0x41 err 0x4 (device error) ata7: EH complete SCSI device sdc: 488397168 512-byte hdwr sectors (250059 MB) sdc: Write Protect is off sdc: Mode Sense: 00 3a 00 00 SCSI device sdc: drive cache: write back ata7.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen ata7.00: (BMDMA stat 0x20) ata7.00: tag 0 cmd 0xca Emask 0x4 stat 0x40 err 0x0 (timeout) ata7: port is slow to respond, please be patient ata7: port failed to respond (30 secs) ata7: soft resetting port ata7: SATA link up 1.5 Gbps (SStatus 113 SControl 300) ata7.00: failed to IDENTIFY (I/O error, err_mask=0x2) ata7.00: revalidation failed (errno=-5) ata7: failed to recover some devices, retrying in 5 secs ata7: hard resetting port ata7: SATA link up 1.5 Gbps (SStatus 113 SControl 300) ata7.00: failed to IDENTIFY (I/O error, err_mask=0x2) ata7.00: revalidation failed (errno=-5) ata7: failed to recover some devices, retrying in 5 secs ata7: hard resetting port ata7: SATA link up 1.5 Gbps (SStatus 113 SControl 300) ata7.00: failed to IDENTIFY (I/O error, err_mask=0x2) ata7.00: revalidation failed (errno=-5) ata7.00: disabled ata7: EH complete First I thought it was a cabling or card issue, because the same drive got kicked twice. That drive was connected to a 2-port SIG sata_sil24 card. However, I just had another drive kicked that's connected to the onboard sata_nv, which leads me to suspect that the upgraded kernel might have something to do with it. A quick googling seems to indicate that others are seeing this with 2.6.18, too, so I was wondering if anyone knows more. The drives contain science data for analysis, so it would be a pain (though not a disaster) to lose it. Would it be advisable to revert to the previous 2.6.17 that I was running before or is this a problem that's fixed in a later kernel than the one I'm running now? I did at the same time also install an Areca ARC1260 controller and connected a bunch of drives to it, so another idea I had was cable interference or something (there are now 18 drives in the machine). Any ideas or thought would be appreciated, /Patrik
Attachment:
signature.asc
Description: OpenPGP digital signature