Re: getting I/O errors in super_written()...any ideas what would cause this?

Chris Friesen <chris.friesen@xxxxxxxxxxx> · Tue, 04 Dec 2012 16:00:51 -0600

On 12/03/2012 03:53 PM, Ric Wheeler wrote:
> On 12/03/2012 04:08 PM, Chris Friesen wrote:
>> On 12/03/2012 02:52 PM, Ric Wheeler wrote:
>>
>>> I jumped into this thread late - can you repost detail on the specific
>>> drive and HBA used here? In any case, it sounds like this is a better
>>> topic for the linux-scsi or linux-ide list where most of the low level
>>> storage people lurk :)
>> Okay, expanding the receiver list. :)
>>
>> To recap:
>>
>> I'm running 2.6.27 with LVM over software RAID 1 over a pair of SAS 
>> disks.
>> Disks are WD9001BKHG, controller is Intel C600.
>>
>> Recently we started seeing messages of the following pattern, and we
>> don't know what's causing them:
>>
>> Nov 28 08:57:10 kernel: end_request: I/O error, dev sda, sector 
>> 1758169523
>> Nov 28 08:57:10 kernel: md: super_written gets error=-5, uptodate=0
>> Nov 28 08:57:10 kernel: raid1: Disk failure on sda2, disabling device.
>> Nov 28 08:57:10 kernel: raid1: Operation continuing on 1 devices.
>>
>> We've been assuming it's a software issue since it's reproducible on
>> multiple systems, although so far we've only seen the problem with
>> these particular disks.
>>
>> We've seen the problems with disk write cache enabled and disabled.
> 
> Hi Chris,
> 
> Are there any earlier IO errors or sda related errors in the log?

Nope, at least not nearby.  On one system for instance we boot up and
get into steady-state, then there are no kernel logs for about half an
hour then out of the blue we see:

Nov 27 14:58:13 base0-0-0-13-0-11-1 kernel: end_request: I/O error, dev sda, sector 1758169523
Nov 27 14:58:13 base0-0-0-13-0-11-1 kernel: md: super_written gets error=-5, uptodate=0
Nov 27 14:58:13 base0-0-0-13-0-11-1 kernel: raid1: Disk failure on sda2, disabling device.
Nov 27 14:58:13 base0-0-0-13-0-11-1 kernel: raid1: Operation continuing on 1 devices.
Nov 27 14:58:13 base0-0-0-13-0-11-1 kernel: end_request: I/O error, dev sdb, sector 1758169523
Nov 27 14:58:13 base0-0-0-13-0-11-1 kernel: md: super_written gets error=-5, uptodate=0
Nov 27 14:58:13 base0-0-0-13-0-11-1 kernel: RAID1 conf printout:
Nov 27 14:58:13 base0-0-0-13-0-11-1 kernel: --- wd:1 rd:2
Nov 27 14:58:13 base0-0-0-13-0-11-1 kernel: disk 0, wo:1, o:0, dev:sda2
Nov 27 14:58:13 base0-0-0-13-0-11-1 kernel: disk 1, wo:0, o:1, dev:sdb2
Nov 27 14:58:13 base0-0-0-13-0-11-1 kernel: RAID1 conf printout:
Nov 27 14:58:13 base0-0-0-13-0-11-1 kernel: --- wd:1 rd:2
Nov 27 14:58:13 base0-0-0-13-0-11-1 kernel: disk 1, wo:0, o:1, dev:sdb2

As another data point, it looks like we may be doing a SEND DIAGNOSTIC
command specifying the default self-test in addition to the background
short self-test.  This seems a bit risky and excessive to me, but
apparently the guy that wrote it is no longer with the company.

What is the recommended method for monitoring disks on a system that
is likely to go a long time between boots?  Do we avoid any in-service
testing and just monitor the SMART data and only test it if something
actually goes wrong?  Or should we intentionally drop a disk out of the
array and test it?  (The downside of that is that we lose
redundancy since we only have 2 disks.)

Chris
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html