Hi, I have similar problem with my two Sun T2000 machines. During last week I got two times a degraded array. Everytime another disk is kicked of the array. On the other T2000 machine the same happend multiple times in the past too. The interesting part is, it is always the same sector involved on every disk as in the original report. After a manual resync of the disks it seems to work for some time until it is failing again. smart doesn't show any errors on the disks. [871180.857895] sd 0:0:0:0: [sda] Result: hostbyte=0x01 driverbyte=0x00 [871180.857929] end_request: I/O error, dev sda, sector 143363852 [871180.857950] md: super_written gets error=-5, uptodate=0 [871180.857968] raid1: Disk failure on sda2, disabling device. [871180.857976] Operation continuing on 1 devices [871180.863652] RAID1 conf printout: [871180.863678] --- wd:1 rd:2 [871180.863694] disk 0, wo:1, o:0, dev:sda2 [871180.863710] disk 1, wo:0, o:1, dev:sdb2 [871180.873021] RAID1 conf printout: [871180.873041] --- wd:1 rd:2 [871180.873053] disk 1, wo:0, o:1, dev:sdb2 [925797.120488] md: data-check of RAID array md0 [925797.120516] md: minimum _guaranteed_ speed: 1000 KB/sec/disk. [925797.120531] md: using maximum available idle IO bandwidth (but not more than 30000 KB/sec) for data-check. [925797.120573] md: using 256k window, over a total of 71585536 blocks. [925797.121308] md: md0: data-check done. [925797.137397] RAID1 conf printout: [925797.137419] --- wd:1 rd:2 [925797.137433] disk 1, wo:0, o:1, dev:sdb2 [1036034.437130] md: unbind<sda2> [1036034.437168] md: export_rdev(sda2) [1036044.572402] md: bind<sda2> [1036044.574923] RAID1 conf printout: [1036044.574945] --- wd:1 rd:2 [1036044.574960] disk 0, wo:1, o:1, dev:sda2 [1036044.574976] disk 1, wo:0, o:1, dev:sdb2 [1036044.575157] md: recovery of RAID array md0 [1036044.575171] md: minimum _guaranteed_ speed: 1000 KB/sec/disk. [1036044.575186] md: using maximum available idle IO bandwidth (but not more than 30000 KB/sec) for recovery. [1036044.575227] md: using 256k window, over a total of 71585536 blocks. [1038465.450853] md: md0: recovery done. [1038465.549707] RAID1 conf printout: [1038465.549728] --- wd:2 rd:2 [1038465.549743] disk 0, wo:0, o:1, dev:sda2 [1038465.549759] disk 1, wo:0, o:1, dev:sdb2 [1192672.830876] sd 0:0:1:0: [sdb] Result: hostbyte=0x01 driverbyte=0x00 [1192672.830910] end_request: I/O error, dev sdb, sector 143363852 [1192672.830932] md: super_written gets error=-5, uptodate=0 [1192672.830951] raid1: Disk failure on sdb2, disabling device. [1192672.830958] Operation continuing on 1 devices [1192672.836943] RAID1 conf printout: [1192672.836964] --- wd:1 rd:2 [1192672.836976] disk 0, wo:0, o:1, dev:sda2 [1192672.836990] disk 1, wo:1, o:0, dev:sdb2 [1192672.846157] RAID1 conf printout: [1192672.846177] --- wd:1 rd:2 [1192672.846189] disk 0, wo:0, o:1, dev:sda2 The used disks are: Device: FUJITSU MAY2073RCSUN72G Version: 0401 Device type: disk Transport protocol: SAS Local Time is: Mon Sep 14 07:24:28 2009 CEST Device supports SMART and is Enabled Temperature Warning Disabled or Not Supported SMART Health Status: OK Current Drive Temperature: 34 C Drive Trip Temperature: 65 C Manufactured in week 38 of year 2006 Recommended maximum start stop count: 10000 times Current start stop count: 56 times Elements in grown defect list: 0 Device: FUJITSU MAY2073RCSUN72G Version: 0401 Device type: disk Transport protocol: SAS Local Time is: Mon Sep 14 07:25:49 2009 CEST Device supports SMART and is Enabled Temperature Warning Disabled or Not Supported SMART Health Status: OK Current Drive Temperature: 33 C Drive Trip Temperature: 65 C Manufactured in week 38 of year 2006 Recommended maximum start stop count: 10000 times Current start stop count: 56 times Elements in grown defect list: 0 Controller: 0000:07:00.0 SCSI storage controller: LSI Logic / Symbios Logic SAS1064ET PCI-Express Fusion-MPT SAS (rev 02) Thanks Marc On Tue, 01 Sep 2009 10:18:06 -0400 Andrei Tanas <andrei@xxxxxxxx> wrote: > On Tue, 01 Sep 2009 09:47:31 -0400, Ric Wheeler <rwheeler@xxxxxxxxxx> > wrote: > >>>> Mine errored out again with exactly the same symptoms, this time after > >>>> only > >>>> few days and with the "tunable" set to 2 sec. I got a warranty > >>>> replacement > >>>> but haven't shipped this one yet. Let me know if you want it. > >>> .. > >>> > >>> Not me. But perhaps Tejun ? > >> > >> I think you're much more qualified than me on the subject. :-) > >> > >> Anyone else? Ric, are you interested with playing the drive? > > > > No thanks.... > > > > I would suggest that Andrei install the new drive and watch it for a few > > days to > > make sure that it does not fail in the same way. If it does, you might > want > > to look at the power supply/cables/etc? > > The drive is the second member of RAID1 array, as far as I understand, both > drives should be experiencing very similar access patterns, and they are > the same model with the same firmware, and manufactured on the same day, > but only one of them showed these symptoms, so there must be something > "special" about it. > By now I think that MD made the right "decision" failing the drive and > removing it from the array, so I guess let's leave it at that. > > Andrei. > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html