On 12/03/2012 02:08 PM, Chris Friesen wrote: > On 12/03/2012 02:52 PM, Ric Wheeler wrote: > >> I jumped into this thread late - can you repost detail on the specific >> drive and HBA used here? In any case, it sounds like this is a better >> topic for the linux-scsi or linux-ide list where most of the low level >> storage people lurk :) > Okay, expanding the receiver list. :) > > To recap: > > I'm running 2.6.27 with LVM over software RAID 1 over a pair of SAS disks. > Disks are WD9001BKHG, controller is Intel C600. Just curious what driver are you using with the C600. The upstream driver for C600 didn't get accepted until 3.0-rc6 and all of the outstanding patches weren't accepted until 3.7-rc. So I'd say 3.6 would be your best bet until 3.7 is released. Did you attempt a backport of the isci driver or using something like an LSI port on 2.6.27? Have you verified the issue on a more recent kernel? > Recently we started seeing messages of the following pattern, and we > don't know what's causing them: > > Nov 28 08:57:10 kernel: end_request: I/O error, dev sda, sector 1758169523 > Nov 28 08:57:10 kernel: md: super_written gets error=-5, uptodate=0 > Nov 28 08:57:10 kernel: raid1: Disk failure on sda2, disabling device. > Nov 28 08:57:10 kernel: raid1: Operation continuing on 1 devices. > > We've been assuming it's a software issue since it's reproducible on > multiple systems, although so far we've only seen the problem with > these particular disks. > > We've seen the problems with disk write cache enabled and disabled. > > It looks like it may be related to being in the middle of a background > short self-test at the time we see the error. The disks are still > in-service at this point--is this supported behaviour or would it > be expected to cause errors? (The self-test works fine with other > disks, and worked fine with these disks until recently, but we haven't > made any changes to the block I/O code.) > > Here's the smartctl output from right after a failure. The self-tests > are frequent as a stress-test, normally they're done once per day: > > root@typhoon-base-unit0:/root> ./smartctl --all /dev/sda > smartctl version 5.38 [i686-wrs-linux-gnu] Copyright (C) 2002-8 Bruce Allen > Home page is http://smartmontools.sourceforge.net/ > > Device: WD WD9001BKHG-02D22 Version: SR03 > Serial number: WX21EB1ANU78 > Device type: disk > Transport protocol: SAS > Local Time is: Fri Nov 23 00:35:03 2012 HKT > Device supports SMART and is Enabled > Temperature Warning Enabled > SMART Health Status: OK > > Current Drive Temperature: 39 C > Drive Trip Temperature: 69 C > Manufactured in week 01 of year 2010 > Recommended maximum start stop count: 1048576 times > Current start stop count: 26 times > Elements in grown defect list: 0 > > Error counter log: > Errors Corrected by Total Correction Gigabytes Total > ECC rereads/ errors algorithm processed uncorrected > fast | delayed rewrites corrected invocations [109 bytes] errors > read: 21187 2 2 21189 2 4950.446 0 > write: 89 4 0 93 4 1317.938 0 > verify: 103 0 0 103 0 0.000 0 > > Non-medium error count: 169436 > > SMART Self-test log > Num Test Status segment LifeTime LBA_first_err [SK ASC ASQ] > Description number (hours) > # 1 Background short Self test in progress ... 4 NOW - [- - -] > # 2 Background short Completed - 1377 - [- - -] > # 3 Background short Completed - 1377 - [- - -] > # 4 Background short Completed - 1377 - [- - -] > # 5 Background short Completed - 1377 - [- - -] > # 6 Background short Completed - 1377 - [- - -] > # 7 Background short Completed - 1377 - [- - -] > # 8 Background short Completed - 1377 - [- - -] > # 9 Background short Completed - 1377 - [- - -] > #10 Background short Completed - 1377 - [- - -] > #11 Background short Completed - 1377 - [- - -] > #12 Background short Completed - 1377 - [- - -] > #13 Background short Completed - 1377 - [- - -] > #14 Background short Completed - 1377 - [- - -] > #15 Background short Completed - 1377 - [- - -] > #16 Background short Completed - 1377 - [- - -] > #17 Background short Completed - 1377 - [- - -] > #18 Background short Completed - 1377 - [- - -] > #19 Background short Completed - 1377 - [- - -] > #20 Background short Completed - 1377 - [- - -] > > Long (extended) Self Test duration: 6362 seconds [106.0 minutes] > > > > > I also have this from ten minutes later with a newer version of smartctl: > > root@typhoon-base-unit0:/root> ./smartctl.eric -x /dev/sda > smartctl 5.40 2010-10-16 r3189 [i686-pc-linux-gnu] (local build) > Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net > > Device: WD WD9001BKHG-02D22 Version: SR03 > Serial number: WX21EB1ANU78 > Device type: disk > Transport protocol: SAS > Local Time is: Fri Nov 23 00:45:08 2012 HKT > Device supports SMART and is Enabled > Temperature Warning Enabled > SMART Health Status: OK > > Current Drive Temperature: 39 C > Drive Trip Temperature: 69 C > Manufactured in week 01 of year 2010 > Specified cycle count over device lifetime: 1048576 > Accumulated start-stop cycles: 26 > Specified load-unload count over device lifetime: 1114112 > Accumulated load-unload cycles: 0 > Elements in grown defect list: 0 > > Error counter log: > Errors Corrected by Total Correction Gigabytes Total > ECC rereads/ errors algorithm processed uncorrected > fast | delayed rewrites corrected invocations [109 bytes] errors > read: 21189 2 2 21191 2 4950.446 0 > write: 89 4 0 93 4 1317.939 0 > verify: 103 0 0 103 0 0.000 0 > > Non-medium error count: 169436 > > SMART Self-test log > Num Test Status segment LifeTime LBA_first_err [SK ASC ASQ] > Description number (hours) > # 1 Background short Completed - 1378 - [- - -] > # 2 Background short Completed - 1378 - [- - -] > # 3 Background short Completed - 1378 - [- - -] > # 4 Background short Completed - 1377 - [- - -] > # 5 Background short Completed - 1377 - [- - -] > # 6 Background short Completed - 1377 - [- - -] > # 7 Background short Completed - 1377 - [- - -] > # 8 Background short Completed - 1377 - [- - -] > # 9 Background short Completed - 1377 - [- - -] > #10 Background short Completed - 1377 - [- - -] > #11 Background short Completed - 1377 - [- - -] > #12 Background short Completed - 1377 - [- - -] > #13 Background short Completed - 1377 - [- - -] > #14 Background short Completed - 1377 - [- - -] > #15 Background short Completed - 1377 - [- - -] > #16 Background short Completed - 1377 - [- - -] > #17 Background short Completed - 1377 - [- - -] > #18 Background short Completed - 1377 - [- - -] > #19 Background short Completed - 1377 - [- - -] > #20 Background short Completed - 1377 - [- - -] > > Long (extended) Self Test duration: 6362 seconds [106.0 minutes] > > Background scan results log > Status: no scans active > Accumulated power on time, hours:minutes 1378:08 [82688 minutes] > Number of background scans performed: 0, scan progress: 0.00% > Number of background medium scans performed: 0 > Protocol Specific port log page for SAS SSP > relative target port id = 1 > generation code = 0 > number of phys = 1 > phy identifier = 0 > attached device type: end device > attached reason: unknown > reason: unknown > negotiated logical link rate: phy enabled; 3 Gbps > attached initiator port: ssp=1 stp=1 smp=1 > attached target port: ssp=0 stp=0 smp=0 > SAS address = 0x50014ee3556977a6 > attached SAS address = 0x5fcfffff00000001 > attached phy identifier = 0 > Invalid DWORD count = 0 > Running disparity error count = 0 > Loss of DWORD synchronization = 3 > Phy reset problem = 0 > Phy event descriptors: > Transmitted SSP frame error count: 0 > Received SSP frame error count: 0 > relative target port id = 2 > generation code = 0 > number of phys = 1 > phy identifier = 1 > attached device type: no device attached > attached reason: unknown > reason: unknown > negotiated logical link rate: phy enabled; unknown > attached initiator port: ssp=0 stp=0 smp=0 > attached target port: ssp=0 stp=0 smp=0 > SAS address = 0x50014ee3556977a7 > attached SAS address = 0x0 > attached phy identifier = 0 > Invalid DWORD count = 0 > Running disparity error count = 0 > Loss of DWORD synchronization = 0 > Phy reset problem = 0 > Phy event descriptors: > Transmitted SSP frame error count: 0 > Received SSP frame error count: 0 > > > > > > > -- > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html