| Date: Tue, 28 Apr 2009 17:59:50 -0700 Sorry for such a slow reply. | From: Dave Stevens <geek@xxxxxxxxxxxx> | Subject: Re: Seagate disk problems (NCQ bug???) | | Quoting "Wolfgang S. Rupprecht" <wolfgang.rupprecht+gnus200904@xxxxxxxxx>: | | > | >After running flawlessly for 6+ months I just had my Seagate | >ST31500343AS (w. SD35 firmware) flake out. Does this look like the NCQ | >bug or just a random event? The final error msg was around the time the | >machine hung hard. | | There is a specific test you can download from Seagate and burn to a bootable | cd. The test on the cd will tell you if it is the ncq bug. They are offering | data recovery if it is indeed a blown disk, they're treating it as a warranty | issue. Can you give us a pointer to official and unofficial information about the NCQ bug? Seagate had a bug in firmware for 7200.11 drives. They publically disclosed a bit about the problem and offered a firmware upgrade near the end of January 2007. If the bug tripped, the drive would locked up and could not be fixed in place. See http://forums.seagate.com/stx/board/message?board.id=ata_drives&thread.id=11972&view=by_date_ascending&page=1 That firmware fix has left a lot of complaining users. That forum thread has 794 messages currently! I've read all of them and cannot really see a pattern for the remaining problems. I started this thread to try to get more coherent reports but it hasn't worked. http://forums.seagate.com/stx/board/message?board.id=ata_drives&thread.id=11184 A number of reports appear to be cases of drives going "offline" for no reported reason. One symptom is drives falling our of RAID arrays. These drives come back after a power cycle. Perhaps your problem is like this one. And I have no idea if NCQ is implicated. But once your drive gets in a bad state, the driver tries a reset and still isn't happy. I'd be surprised if NCQ is used between the reset and the subsequent failure | >Apr 28 04:26:29 arbol kernel: ata1: SATA max UDMA/133 irq_stat 0x00400000, | >PHY RDY changed irq 22 | >Apr 28 04:26:29 arbol kernel: ata1: softreset failed (device not ready) | >Apr 28 04:26:29 arbol kernel: ata1: failed due to HW bug, retry pmp=0 | >Apr 28 04:26:29 arbol kernel: ata1: SATA link up 3.0 Gbps (SStatus 123 | >SControl 300) | >Apr 28 04:26:29 arbol kernel: ata1.00: ATA-8: ST31500343AS, SD35, max | >UDMA/133 | >Apr 28 04:26:29 arbol kernel: ata1.00: 2930277168 sectors, multi 16: LBA48 | >NCQ (depth 31/32) | >Apr 28 04:26:29 arbol kernel: ata1.00: configured for UDMA/133 Time passes. Happily, I assume. | >Apr 28 06:17:02 arbol kernel: ata1.00: exception Emask 0x50 SAct 0x1 SErr | >0x90a02 action 0xe frozen | >Apr 28 06:17:02 arbol kernel: ata1.00: irq_stat 0x00400000, PHY RDY changed | >Apr 28 06:17:02 arbol kernel: ata1: SError: { RecovComm Persist HostInt | >PHYRdyChg 10B8B } | >Apr 28 06:17:02 arbol kernel: ata1.00: cmd | >60/08:00:e1:81:24/00:00:74:00:00/40 tag 0 ncq 4096 in | >Apr 28 06:17:02 arbol kernel: res 40/00:00:e1:81:24/00:00:74:00:00/40 | >Emask 0x50 (ATA bus error) | >Apr 28 06:17:02 arbol kernel: ata1.00: status: { DRDY } Something has gone wrong (duh!), but I don't know enough to say what. | >Apr 28 06:17:02 arbol kernel: ata1: hard resetting link Here's a reset. I bet NCQ will not be used for a while (until drive appears to be up again after the reset). | >Apr 28 06:17:04 arbol kernel: ata1: SATA link up 3.0 Gbps (SStatus 123 | >SControl 300) | >Apr 28 06:17:09 arbol kernel: ata1.00: qc timeout (cmd 0xec) | >Apr 28 06:17:09 arbol kernel: ata1.00: failed to IDENTIFY (I/O error, | >err_mask=0x4) IDENTIFY is a command that asks the drive about its characteristics. I would be astonished if a driver would be using NCQ at this point. | >Apr 28 06:17:09 arbol kernel: ata1.00: revalidation failed (errno=-5) | >Apr 28 06:17:09 arbol kernel: ata1: hard resetting link Another reset. | >Apr 28 06:17:11 arbol kernel: ata1: SATA link up 3.0 Gbps (SStatus 123 | >SControl 300) | >Apr 28 06:17:21 arbol kernel: ata1.00: qc timeout (cmd 0xec) | >Apr 28 06:17:21 arbol kernel: ata1.00: failed to IDENTIFY (I/O error, | >err_mask=0x4) Another failure. Again, I would not expect NCQ to be used at this point. And so it goes. I infer that this cycle goes on until the power is turned off. -- fedora-list mailing list fedora-list@xxxxxxxxxx To unsubscribe: https://www.redhat.com/mailman/listinfo/fedora-list Guidelines: http://fedoraproject.org/wiki/Communicate/MailingListGuidelines