Re: libata error/reset

Tejun Heo <tj@xxxxxxxxxx> · Tue, 09 Sep 2008 13:28:36 +0200

Dan Noé wrote:
> Just after midnight last night, during an rsync job which copies a lot
> of data onto my backup disk (half of a Linux software RAID 1), I
> received the following:
> 
> -- SNIP --
> ata3.00: exception Emask 0x10 SAct 0x0 SErr 0x10000 action 0xe frozen
> ata3.00: irq_stat 0x00400000, PHY RDY changed
> ata3: SError: { PHYRdyChg }
> ata3.00: cmd 35/00:10:3f:00:34/00:00:22:00:00/e0 tag 0 dma 8192 out
>          res 50/00:00:4e:01:18/00:00:22:00:00/e0 Emask 0x10 (ATA bus error)
> ata3.00: status: { DRDY }
> ata3: hard resetting link
> ata3: link is slow to respond, please be patient (ready=0)
> ata3: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
> ata3.00: qc timeout (cmd 0xec)
> ata3.00: failed to IDENTIFY (I/O error, err_mask=0x5)
> ata3.00: revalidation failed (errno=-5)
> ata3: failed to recover some devices, retrying in 5 secs
> ata3: hard resetting link
> ata3: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
> ata3.00: configured for UDMA/100
> ata3: EH complete
> sd 2:0:0:0: [sdc] 625142448 512-byte hardware sectors (320073 MB)
> sd 2:0:0:0: [sdc] Write Protect is off
> sd 2:0:0:0: [sdc] Mode Sense: 00 3a 00 00
> sd 2:0:0:0: [sdc] Write cache: enabled, read cache: enabled, doesn't
> support DPO or FUA
> -- SNIP --
> 
> The system seems to be working fine now, and there was not even a RAID
> failure reported by the md system.  Is this something I should be
> concerned about? Hardware issue, software bug?
> 
> Linux colobus 2.6.26.3 #1 SMP Thu Aug 21 10:15:38 EDT 2008 i686 Intel(R)
> Pentium(R) 4 CPU 3.20GHz GenuineIntel GNU/Linux
> 
> 00:1f.2 SATA controller: Intel Corporation 82801FR/FRW (ICH6R/ICH6RW)
> SATA Controller (rev 03)
> 
> I am using the libata ahci driver.  There are four drives crammed into a
> 1U with hotplug trays, but AFAIK no one was poking around the system.

Transmission errors do occur occassionally on perfectly healthy
machines so if it doesn't happen regularly, you can just ignore it and
the kernel will do the right thing.  However, there have been
non-insignificant number of cases where sucky power supply fail to
maintain voltage under high IO load and make harddrive go offline
briefly which would also show up as PHYRdyChg.  In these cases, you
can usually hear the drive doing emergency unloading (clicking) and
smartctl -a is likely to show increased values for start/stop count
and/or emergency unload count.  In these cases, the drive loses data
in its buffer and filesystem gets corrupted and you really should get
a better power supply.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html