Tom, I did not review your e-mail in total, but using lots of SATA drives in a big RAID array is not something I would attempt with 2.6.17 or older kernels. (I know 2.6.17 is not even out yet.). In 2.6.17-mm there is a huge SATA error handler (EH) rewrite. Is is planned to hit the stable Linus kernel with 2.6.18 towards the end of the summer, but even then it will only have a few of the actual drivers modified to use the EH infrastructure. I would repost your problem to the lkml-ide list and see if they think that the new EH should help you, and when/if your controller will be using the new EH infrastructure. FYI: that is linux-ide@xxxxxxxxxxxxxxx: sata is discussed there, no need to subscribe, they will cc you on responses. Also, there is a ton of testing going on with the new EH, so if your willing to be a guinea pig, I'm sure you will get a lot of support from the dev. team and get your specific driver updated ASAP. HTH Greg -- Greg Freemyer On 6/14/06, Tom Wirschell <Tom@xxxxxxxxxxxx> wrote:
On 14 Jun 2006, Rune Saetre wrote: > > I always thought the loud click came from the disks parking their > heads before spinning down. Well, it's most certainly loud. The same type of loud that you get when the machine shuts down and removes the power from the drives. I thought recalibration ticks weren't particularly loud. > Anyway, it can take several seconds before a disk responds to > commands after having spun down. The problem isn't that it takes time to come back up after a spin down. The drive isn't spinning down. It's turning itself off completely (note the 'no device found' bit in the error). And it does this while it's actively being used. > On Wed, 14 Jun 2006, Molle Bestefich wrote: > > > > Does the drive's SMART log say anything interesting? That's a damned good question. I didn't even know you could query that, so I just recreated the array and started my test again. Took about 90 minutes for one of the drives to die. Unfortunately when it dies it refuses to respond to anything. When I try the smartctl program on the failed drive I get: Smartctl: Device Read Identity Failed (not an ATA/ATAPI device) When I issue the exact same command for another disk on the controller I get a nice listing that you would expect from this program. When I use hdparm -I on the died drive I get: HDIO_DRIVE_CMD(identify) failed: Input/output error And again, if I issue the exact same command for another disk on this same controller I get a nice bit of info on the drive. To me at least, this basically says that the drive is actually turned off at this point in time. It would explain why SMART isn't getting any data. On the other hand, it doesn't explain *WHY* the drive is off. Do you know any program that's capable of telling a drive that isn't on to activate itself? I don't think it's even possible but might be mistaken there. So, I reboot, run smartctl again and I'm presented with a nice sheet of output that basically says all is well, nothing ever went wrong with this drive and you can feel safe in using it. This royally sucks... > > Have you tried poking the IDE driver to reset the bus, might get it > > running again? How would I do this? I've compiled the driver into the kernel. But if SMART data is kept even when a drive is off, this won't fix anything. > > Not a very pretty solution, especially since you might still suffer > > two drives going down at once from time to time. Maybe you can > > patch MD to pause the array and poke the IDE driver whenever a disk > > is lost? Then you would at least only have intermittent failures / > > timeouts on a rare basis rather than a non-redundant array when > > something happens. The problem is that I can't tell if it's really MD that is telling the drive to turn itself off. Is there even code in MD that does this? Shouldn't it complain VERY LOUDLY that it's unhappy with a drive and thus decide to kill it? > > If the disk never comes up, being patient surely won't help. > > Wait for an hour and see if the drive comes up, ask the WD folks > > exactly how patient they want you to be? :-) The assumption was that since the drive took so long to respond, MD is telling the drive "You know what, fuck it. Never mind those outstanding requests, just shut down and let the rest of us get on with business", only thereby killing the array. > > bonnie++ does random seeks, right? I think so, yeah. Kind regards, Tom Wirschell -- dm-devel@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/dm-devel
-- Greg Freemyer The Norcross Group Forensics for the 21st Century -- dm-devel@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/dm-devel