[632714.975844] ata1.00: status: { DRDY ERR } [632714.975847] ata1.00: error: { IDNF }
IDNF - ID not found, usually these days means 'asked for a sector out of range'
Justin, was this error provoked for example by you using a 'dd' command to try and read stuff past the end of the disk? Because the md device driver should not try and do something like that. The md device driver knows how many LBAs are available on the disk and should never issue a read or write past the end of the device. Justin, I have the impression that this IDNF error was NOT due to you issuing dd commands or similar, but was part of 'normal' md operation. Alan, am I overlooking something? (One feature that might provide such errors is the 'host reserved address' set of commands that allow you to 'clip' the disk capacity.)
Most of the errors were during normal md/Linux operation. However, I will also note that with a 16-port 3ware controller, drives were also having problems, different errors mind you but they were rather nasty. The errors occured on RAID-1, RAID-5 and RAID-6 and even on a single disk, no md/Linux, I kept running bonnie++ with different filesystems on one disk, ext3 and XFS, both of which started erroring out.
When you say 'the errors occured' do you mean specifically the 'IDNF' errors? Those are the onces I am talking about. I am NOT talking about other types of errors like UNC or ABORT. I am ONLY talking about the IDNF errors.
I removed all of the Velociraptors and replaced them with good old Raptor150s and tested with md/Linux and the 3ware card, no problems with either configuration. I truly believe something is wrong with the Velociraptors at this point since both md/Linux and 3ware both had problems with the drives.
It is easy to check if the IDNF errors are from the disk firmware or the Linux kernel.
The kernel error message for IDNF should report the LBA which provoked that IDNF error. Please compare that with the maximum LBA on the disk, which you can get either via hdparm or smartctl -i (divide total capacity in bytes by 512).
If the LBA reported in the kernel error message is greater than the maximum LBA on the disk, then the problem is in the Linux kernel. If the LBA reported by the kernel is less than the maximum LBA on the disk, then the problem is the disk firmware (or perhaps the disk has its capacity 'clipped' with the 'Host Protected Area' feature set. But the Linux kernel should know about this, so that would be a kernel bug again, not a disk firmware bug).
Cheers, Bruce -- To unsubscribe from this list: send the line "unsubscribe linux-ide" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html