Hello List
I have a RH 9 system (kernel 2.4.27 vanilla) running a Dell SCSI disk array
(JBOD) with a bunch of SW RAID5 devices... lately, I've been having a lot
of problems with disks dropping from the RAID devices. I've replaced these
disks (via the usual SCSI cat > /proc/scsi/scsi and a
raidhotremove/raidhotadd ) These devices get added OK and the RAID recovery
goes through, but the next morning the RAIDs have bad devices again... I've
done this a few times during the week and doubt that all of these disks are
bad...
Today, the system is complaining about sdq. The kernel log is full of
messages like:
Jun 14 09:02:50 hal kernel: md: updating md4 RAID superblock on device
Jun 14 09:02:50 hal kernel: md: sdq1 [events: 0000002f]<6>(write) sdq1's sb
offset: 430108096
Jun 14 09:02:50 hal kernel: I/O error: dev 41:01, sector 860216192
Jun 14 09:02:50 hal kernel: md: write_disk_sb failed for device sdq1
Jun 14 09:02:50 hal kernel: md: sdm1 [events: 0000002f]<6>(write) sdm1's sb
offset: 143371968
Jun 14 09:02:50 hal kernel: md: sdl1 [events: 0000002f]<6>(write) sdl1's sb
offset: 143371968
Jun 14 09:02:50 hal kernel: md: sdk1 [events: 0000002f]<6>(write) sdk1's sb
offset: 143371968
Jun 14 09:02:50 hal kernel: md: errors occurred during superblock update,
repeating
The system is cranky about sdq:
[root@hal dev]# mdadm --examine /dev/sdq1
mdadm: Cannot read superblock on /dev/sdq1
However, the RAID device containing sdq is still up and running (for now,
anyways):
md4 : active raid5 sdq1[0] sdm1[3] sdl1[2] sdk1[1]
430115904 blocks level 5, 64k chunk, algorithm 2 [4/4] [UUUU]
All signs would point to the fact that sdq is a dead drive, but this has
happened 2 or 3 times this week with a few fairly new drives, so I don't
think they're all bad. Can anyone shed any wisdom as to why this may be
happening? This system has been running for about 18 months and hasn't had
a lot of problems - maybe 2 or 3 drives have quit. The stupid Dell disk
array sometimes kicks out disks when they're not bad, so it may be at fault
in this case... Can anyone offer any ideas as to how to troubleshoot?
Related to troubleshooting: Over the past few days, SCSI devices have been
removed and added and the SCSI disk lettering is all jumbled around... Now,
when I see that sdX fails in the log, I can no longer figure out which
physical disk or which SCSI ID that sdX corresponds to. Originally, I'd
just count down the disk array until I arrived at the right letter, and the
SCSI ID is marked on the array. Now I can't do that, as the letters are
jumbled around a little. Is there a trick or technique to determine which
SCSI ID that sdX corresponds to?
Thanks in advance,
Mark
Mark Cuss, B. Sc.
Real Time Systems Analyst
System Administrator
CDL Systems Ltd
Suite 230
3553 - 31 Street NW
Calgary, AB, Canada
Phone: 403 289 1733 ext 226
Fax: 403 289 3967
www.cdlsystems.com
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html