On 12/03/2012 03:53 PM, Ric Wheeler wrote: > On 12/03/2012 04:08 PM, Chris Friesen wrote: >> On 12/03/2012 02:52 PM, Ric Wheeler wrote: >> >>> I jumped into this thread late - can you repost detail on the specific >>> drive and HBA used here? In any case, it sounds like this is a better >>> topic for the linux-scsi or linux-ide list where most of the low level >>> storage people lurk :) >> Okay, expanding the receiver list. :) >> >> To recap: >> >> I'm running 2.6.27 with LVM over software RAID 1 over a pair of SAS >> disks. >> Disks are WD9001BKHG, controller is Intel C600. >> >> Recently we started seeing messages of the following pattern, and we >> don't know what's causing them: >> >> Nov 28 08:57:10 kernel: end_request: I/O error, dev sda, sector >> 1758169523 >> Nov 28 08:57:10 kernel: md: super_written gets error=-5, uptodate=0 >> Nov 28 08:57:10 kernel: raid1: Disk failure on sda2, disabling device. >> Nov 28 08:57:10 kernel: raid1: Operation continuing on 1 devices. >> >> We've been assuming it's a software issue since it's reproducible on >> multiple systems, although so far we've only seen the problem with >> these particular disks. >> >> We've seen the problems with disk write cache enabled and disabled. > > Hi Chris, > > Are there any earlier IO errors or sda related errors in the log? Nope, at least not nearby. On one system for instance we boot up and get into steady-state, then there are no kernel logs for about half an hour then out of the blue we see: Nov 27 14:58:13 base0-0-0-13-0-11-1 kernel: end_request: I/O error, dev sda, sector 1758169523 Nov 27 14:58:13 base0-0-0-13-0-11-1 kernel: md: super_written gets error=-5, uptodate=0 Nov 27 14:58:13 base0-0-0-13-0-11-1 kernel: raid1: Disk failure on sda2, disabling device. Nov 27 14:58:13 base0-0-0-13-0-11-1 kernel: raid1: Operation continuing on 1 devices. Nov 27 14:58:13 base0-0-0-13-0-11-1 kernel: end_request: I/O error, dev sdb, sector 1758169523 Nov 27 14:58:13 base0-0-0-13-0-11-1 kernel: md: super_written gets error=-5, uptodate=0 Nov 27 14:58:13 base0-0-0-13-0-11-1 kernel: RAID1 conf printout: Nov 27 14:58:13 base0-0-0-13-0-11-1 kernel: --- wd:1 rd:2 Nov 27 14:58:13 base0-0-0-13-0-11-1 kernel: disk 0, wo:1, o:0, dev:sda2 Nov 27 14:58:13 base0-0-0-13-0-11-1 kernel: disk 1, wo:0, o:1, dev:sdb2 Nov 27 14:58:13 base0-0-0-13-0-11-1 kernel: RAID1 conf printout: Nov 27 14:58:13 base0-0-0-13-0-11-1 kernel: --- wd:1 rd:2 Nov 27 14:58:13 base0-0-0-13-0-11-1 kernel: disk 1, wo:0, o:1, dev:sdb2 As another data point, it looks like we may be doing a SEND DIAGNOSTIC command specifying the default self-test in addition to the background short self-test. This seems a bit risky and excessive to me, but apparently the guy that wrote it is no longer with the company. What is the recommended method for monitoring disks on a system that is likely to go a long time between boots? Do we avoid any in-service testing and just monitor the SMART data and only test it if something actually goes wrong? Or should we intentionally drop a disk out of the array and test it? (The downside of that is that we lose redundancy since we only have 2 disks.) Chris -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html