Drives disappearing from /dev/ during surface scan

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hello all,

I'm wondering if anyone could share some insight into a problem I'm having.

The problem is that every week, one or two harddrives simply disappear (from /dev/) during the weekly wednesday morning long surface scan triggered by smartctl. The scan starts at 6 am, and the drives dropped at 6:30 am and the next week at 8:30 am (the surface scans take I think ~3 hours).

Messages in syslog are similar to this:

Jun 23 08:27:58 Ukyo kernel: ata3: hard resetting link
Jun 23 08:27:59 Ukyo kernel: ata3: SATA link down (SStatus 0 SControl 300)
Jun 23 08:28:04 Ukyo kernel: ata3: hard resetting link
Jun 23 08:28:04 Ukyo kernel: ata3: SATA link down (SStatus 0 SControl 300)
Jun 23 08:28:09 Ukyo kernel: ata3: hard resetting link
Jun 23 08:28:09 Ukyo kernel: ata3: SATA link down (SStatus 0 SControl 300)
Jun 23 08:28:09 Ukyo kernel: ata3.00: disabled
Jun 23 08:28:09 Ukyo kernel: sd 3:0:0:0: [sdc] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE,SUGGEST_OK Jun 23 08:28:09 Ukyo kernel: sd 3:0:0:0: [sdc] Sense Key : Aborted Command [current] [descriptor] Jun 23 08:28:09 Ukyo kernel: Descriptor sense data with sense descriptors (in hex): Jun 23 08:28:09 Ukyo kernel: 72 0b 47 00 00 00 00 0c 00 0a 80 00 00 00 00 00
Jun 23 08:28:09 Ukyo kernel:         0f ff ff ff
Jun 23 08:28:09 Ukyo kernel: sd 3:0:0:0: [sdc] Add. Sense: Scsi parity error Jun 23 08:28:09 Ukyo kernel: sd 3:0:0:0: [sdc] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK,SUGGEST_OK
Jun 23 08:28:09 Ukyo kernel: md: super_written gets error=-5, uptodate=0
Jun 23 08:28:09 Ukyo kernel: ata3: EH complete
Jun 23 08:28:09 Ukyo kernel: ata3.00: detaching (SCSI 3:0:0:0)
Jun 23 08:28:09 Ukyo kernel: sd 3:0:0:0: [sdc] Stopping disk
Jun 23 08:28:09 Ukyo kernel: sd 3:0:0:0: [sdc] START_STOP FAILED
Jun 23 08:28:09 Ukyo kernel: sd 3:0:0:0: [sdc] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK,SUGGEST_OK
Trying to get the drive back by using:

echo "- - -" /sys/class/scsi_host/hostX/scan

Has no effect (in the log, ata3 gets rescanned but no drives are found):
Jun 24 16:36:21 Ukyo kernel: ata3: hard resetting link
Jun 24 16:36:22 Ukyo kernel: ata3: SATA link down (SStatus 0 SControl 300)
Jun 24 16:36:22 Ukyo kernel: ata3: EH complete
Jun 24 16:36:31 Ukyo kernel: ata4: hard resetting link
Jun 24 16:36:32 Ukyo kernel: ata4: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Jun 24 16:36:32 Ukyo kernel: ata4.00: configured for UDMA/133
Jun 24 16:36:32 Ukyo kernel: ata4: EH complete
Jun 24 16:36:32 Ukyo kernel: sd 4:0:0:0: [sdd] 1953525168 512-byte hardware sectors (1000205 MB)
Jun 24 16:36:32 Ukyo kernel: sd 4:0:0:0: [sdd] Write Protect is off
Jun 24 16:36:32 Ukyo kernel: sd 4:0:0:0: [sdd] Write cache: disabled, read cache: enabled, doesn't support DPO or
Rebooting the system returns /dev/sdc (ata3) to working order, and re-adding it to the array results in a short repair and everything is good again for another week.

This only happens during the surface scan, and has been hard to reproduce with just regular server use (copying, array rebuilding, etc..)

I suspect it may be a power issue, so I'm supplying some more numbers. The PSU is rated for 460 watt, 165 watt 5v, 312 watt 12v. There's 10 drives in there, all recent models (1 TB+). System temperature is normal (four 12 cm fans installed, not counting the PSU one).

I'm however somewhat skeptical about the power issue, as I used to have another server that would hit its power limiter during a cold start (ie, it would not power on as the spin-up cycle caused an overload) -- however, that server would still power up fully when forcing it to start by simply powering it on 2 or 3 times quickly in a row. It would run stable for months once it managed to spin up all drives.

Any insights why this might occur is appreciated. I'm currently considering spreading the weekly surface scans out a bit to prevent this, but would rather find out what the real issue is. Other things I'm considering is replacing two drives for one drive (2x 1 TB -> 2 TB) to reduce power load a bit... or finding a PSU that is rated a bit higher.

--John





--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Index of Archives]     [Linux RAID Wiki]     [ATA RAID]     [Linux SCSI Target Infrastructure]     [Linux Block]     [Linux IDE]     [Linux SCSI]     [Linux Hams]     [Device Mapper]     [Device Mapper Cryptographics]     [Kernel]     [Linux Admin]     [Linux Net]     [GFS]     [RPM]     [git]     [Yosemite Forum]


  Powered by Linux