Peter Kjellström wrote: > On Wednesday 07 March 2012 11.17.15 m.roth@xxxxxxxxx wrote: >> Got a bunch of servers from Penguin. Supermicro m/b's H8QG6. We put a >> 3tb drive in for additional workspace for the users, and some of them >> won't read, others will go for weeks, then spit out DRDY errors. lshw >> shows the controller as an ATI SB7x0/SB8x0/SB9x0 SATA. > ... >> Now, I've been working on one with Penguin. I noticed one thing, that it >> was set to native IDE. After googling, I saw that the most recent spec, >> which included EIDE, should be good to petabytes... but I tried >> resetting it to AHCI anyway. >> >> The user ran one job, ok... then another last night, and it's spitting >> the same errors. > ... >> Mar 7 00:53:28 <server> kernel: ata2.00: failed command: WRITE FPDMA >> QUEUED > ... >> 40/00:04:20:00:00/00:00:00:00:00/40 Emask 0x4 (timeout) > ... >> Mar 7 00:53:28 <server> kernel: ata2: hard resetting link > > While writing the drive timed out and the link to it was then subjected to > a hard reset. This is not normal and usually points to bad drive or buggy > firmware. > > Have you had a look at smartdata for the drive(s)? (you may want to run > the smart selftests) > > Also, I'd suggest you test it in a controlled environment. For example, > can any of your drives survive a full surface write? (dd if=/dev/zero > bs=1M of=..) > Full surface read? Do the tests against /dev/sdX to be sure (excludes > partitioning, filesystems, volume management, etc.) > > Do note that writing your drive full of zeros _will_ destroy your data (I > really hope that's stating the obvious...). <g> Of course. Nahhh... I've run bonnie++ against it, but couldn't provoke it. It's this one user, who runs *large* jobs, with big o/p, when it hits. smartctl - I ran the short test just before lunch, and smartctl -H reports it passed, completed without errors. I saw that it timed out. One of the reasons for some of the stuff I included, above, was that kernel: ata2.00: device reported invalid CHS sector 0 Also, I noticed that lshw showed the ATI controller having a width of 32 bits, and a clock of 66MHz, and wondered if there could be some sort of slip-through-the-cracks where the driver didn't handle this correctly. mark _______________________________________________ CentOS mailing list CentOS@xxxxxxxxxx http://lists.centos.org/mailman/listinfo/centos