Hello,
jon@xxxxxxxxxxxxxxxxxx wrote:
So far I've seen two sorts of errors. They both seem to be preceded
by a sort of "chirp" from the drive. The first case resulted in journal
failure and remounting of the partition that occurred on R/O, the second
appeared to be more of a transient failure - after locking up the
machine for a minute, things resumed. The syslogs looked like this:
First error:
Sep 2 23:07:56 rocky kernel: ata1: command 0x25 timeout, stat 0x50
host_stat 0x1
Sep 2 23:07:56 rocky kernel: ata1: status=0x50 { DriveReady SeekComplete }
Sep 2 23:07:56 rocky kernel: ata1: error=0x01 { AddrMarkNotFound }
Sep 2 23:07:56 rocky kernel: sda: Current: sense key: No Sense
Sep 2 23:07:56 rocky kernel: Additional sense: No additional sense
information
Sep 2 23:07:56 rocky kernel: EXT3-fs error (device sda4):
ext3_free_blocks: Freeing blocks not in datazone - block = 1977993469,
count = 1
Sep 2 23:07:56 rocky kernel: Aborting journal on device sda4.
Sep 2 23:07:56 rocky kernel: ext3_abort called.
Sep 2 23:07:56 rocky kernel: EXT3-fs error (device sda4):
ext3_journal_start_sb: Detected aborted journal
Sep 2 23:07:56 rocky kernel: Remounting filesystem read-only
Sep 2 23:07:56 rocky kernel: EXT3-fs error (device sda4):
ext3_free_blocks: Freeing blocks not in datazone - block = 1499238360,
count = 1
Sep 2 23:07:56 rocky kernel: EXT3-fs error (device sda4):
ext3_free_blocks: Freeing blocks not in datazone - block = 1092876199,
count = 1
[... and many, many more of the last line - there were hundreds of
blocks recovered into lost+found after fsck, although their contents
may all have been from previously deleted files]
Second error:
Sep 3 00:02:18 rocky kernel: ata1: command 0xca timeout, stat 0x50
host_stat 0x1
Sep 3 00:02:18 rocky kernel: ata1: status=0x50 { DriveReady SeekComplete }
Sep 3 00:02:18 rocky kernel: ata1: error=0x01 { AddrMarkNotFound }
Sep 3 00:02:18 rocky kernel: sda: Current: sense key: No Sense
Sep 3 00:02:18 rocky kernel: Additional sense: No additional sense
information
Sep 3 00:02:18 rocky kernel: Info fld=0x1
Is either of these related to the "m15w" error? Or would you have
any other suggestions as to a known cause of the problem? I looked in
sata_sil.c, and the ST3400633AS is not on the blacklist in this kernel.
No, none is related to m15w. It seems that your drive is failing some
commands w/ ID not found error, which might be a media problem.
Anyways, libata is having problem recovering from the error condition
and retrying the command, thus the catastrophe.
So far I've upgraded the BIOS to the latest from Abit, which
includes a more recent SATA BIOS from Silicon Image, and fiddled with
some of the BIOS settings - particularly changing Ext-P2P Discard from
30us to 1ms, as suggested in a much older NVIDIA/Abit bug dialogue. I
don't know if any of this is actually helping yet, though.
I'm skeptical.
Can you try 2.6.18-rc5? Latest libata has much improved error handling.
If the error your drive is reporting are transient, new libata EH
should be able to recover from most of them and, even if not, it will
help diagnosing the problem.
Thanks.
--
tejun
--
VGER BF report: H 3.80529e-06
-
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html