Scripts watch SW RAID1 for HD failure

Robin Whittle <rw@firstpr.com.au> · Mon, 04 Feb 2002 00:29:34 +1100

Summary:  What the log files look like when a hard drive fails.
          Scripts to email a problem report when /proc/mdstat looks bad.

I have been running software RAID1 on Red Hat 7.1 since June last year. 
All was well until recently when there were unrecoverable disk errors.  
It turned out one drive failed in October and silly me had not noticed! 

  (By the way, both drives were IBM DTLA 75GXP drives - 20 gigs - 
   and I suggest anyone with this series of drive backup their data 
   and take these lemons back!  I have had three out of three fail, 
   and the drives are notorious, as Googling DTLA failure shows.
   There is even a class action suit about them.)

I had not known what sort of errors to look for if a drive failed.  There 
are two ways of checking.   

One is to keep an eye on the results of:

    cat /proc/mdstat

The other approach is to look at /var/log/messages for specific things.  
It wasn't clear what these might be, but here are some sets of lines 
I found.   I found a reliable way of searching 7 months of messages 
for hard drive trouble was to search for: "raid1: h".

This was for one of the two drives failing, and the system carrying on 
working fine.  This was during normal operation, not at boot time.
hdc5 failed and md2 was reliant on hda5 alone.

= = = = = = = = = = = 

Oct 21 00:10:43 mybox kernel: hdc: dma_intr: status=0x51 { DriveReady SeekComplete Error }
Oct 21 00:10:43 mybox kernel: hdc: dma_intr: error=0x40 { UncorrectableError }, LBAsect=230336, sector=182056
Oct 21 00:10:43 mybox kernel: end_request: I/O error, dev 16:05 (hdc), sector 182056
Oct 21 00:10:43 mybox kernel: raid1: Disk failure on hdc5, disabling device. 
Oct 21 00:10:43 mybox kernel: ^IOperation continuing on 1 devices
Oct 21 00:10:43 mybox kernel: raid1: hdc5: rescheduling block 182056
Oct 21 00:10:43 mybox kernel: md: recovery thread got woken up ...
Oct 21 00:10:43 mybox kernel: md2: no spare disk to reconstruct array! -- continuing in degraded mode
Oct 21 00:10:43 mybox kernel: md: recovery thread finished ...
Oct 21 00:10:43 mybox kernel: dirty sb detected, updating.
Oct 21 00:10:43 mybox kernel: md: updating md2 RAID superblock on device
Oct 21 00:10:43 mybox kernel: hda5 [events: 0000006f](write) hda5's sb offset: 3172736
Oct 21 00:10:43 mybox kernel: (skipping faulty hdc5 )
Oct 21 00:10:53 mybox kernel: .
Oct 21 00:10:53 mybox kernel: raid1: hda5: redirecting sector 182056 to another mirror
Oct 21 00:10:53 mybox kernel: hdc: dma_intr: status=0x51 { DriveReady SeekComplete Error }
Oct 21 00:10:53 mybox kernel: hdc: dma_intr: error=0x40 { UncorrectableError }, LBAsect=230336, sector=182064
Oct 21 00:10:53 mybox kernel: end_request: I/O error, dev 16:05 (hdc), sector 182064
Oct 21 00:10:53 mybox kernel: raid1: hdc5: rescheduling block 182064
Oct 21 00:10:53 mybox kernel: md: recovery thread got woken up ...
Oct 21 00:10:53 mybox kernel: md2: no spare disk to reconstruct array! -- continuing in degraded mode
Oct 21 00:10:53 mybox kernel: md: recovery thread finished ...
Oct 21 00:10:53 mybox kernel: raid1: hda5: redirecting sector 182064 to another mirror
Oct 21 00:10:53 mybox kernel: hdc: dma_intr: status=0x51 { DriveReady SeekComplete Error }
Oct 21 00:10:54 mybox kernel: hdc: dma_intr: error=0x40 { UncorrectableError }, LBAsect=230336, sector=182072
Oct 21 00:10:54 mybox kernel: end_request: I/O error, dev 16:05 (hdc), sector 182072
Oct 21 00:10:54 mybox kernel: raid1: hdc5: rescheduling block 182072
Oct 21 00:10:54 mybox kernel: md: recovery thread got woken up ...
Oct 21 00:10:54 mybox kernel: md2: no spare disk to reconstruct array! -- continuing in degraded mode
Oct 21 00:10:54 mybox kernel: md: recovery thread finished ...
Oct 21 00:10:54 mybox kernel: raid1: hda5: redirecting sector 182072 to another mirror

= = = = = = = = = = = 

Here is the error reports when another partition hdc6 on the same 
drive failed, taking md3 down to one drive too: hda6.

= = = = = = = = = = = 

Jan 18 04:03:31 mybox kernel: hdc: dma_intr: status=0x51 { DriveReady SeekComplete Error }
Jan 18 04:03:31 mybox kernel: hdc: dma_intr: error=0x40 { UncorrectableError }, LBAsect=8757269, sector=2363336
Jan 18 04:03:31 mybox kernel: end_request: I/O error, dev 16:06 (hdc), sector 2363336
Jan 18 04:03:31 mybox kernel: raid1: Disk failure on hdc6, disabling device. 
Jan 18 04:03:31 mybox kernel: ^IOperation continuing on 1 devices
Jan 18 04:03:31 mybox kernel: raid1: hdc6: rescheduling block 2363336
Jan 18 04:03:31 mybox kernel: md: recovery thread got woken up ...
Jan 18 04:03:31 mybox kernel: md3: no spare disk to reconstruct array! -- continuing in degraded mode
Jan 18 04:03:31 mybox kernel: md2: no spare disk to reconstruct array! -- continuing in degraded mode
Jan 18 04:03:31 mybox kernel: md: recovery thread finished ...
Jan 18 04:03:31 mybox kernel: raid1: dirty sb detected, updating.
Jan 18 04:03:31 mybox kernel: md: updating md3 RAID superblock on device
Jan 18 04:03:31 mybox kernel: md: hda6 [events: 00000075](write) hda6's sb offset: 2104384
Jan 18 04:03:31 mybox kernel: md: (skipping faulty hdc6 )
Jan 18 04:03:31 mybox kernel: raid1: hda6: redirecting sector 2363336 to another mirror

= = = = = = = = = = = 

Here are three similar sets of error messages when there were errors on 
hda5 which could not be recovered from.  This is when I noticed things going
wrong - a file not being written due to a directory being unusable because of
a corrupted file.

= = = = = = = = = = = 

Jan 24 04:49:29 gair kernel: hda: dma_intr: status=0x51 { DriveReady SeekComplete Error }
Jan 24 04:49:29 gair kernel: hda: dma_intr: error=0x40 { UncorrectableError }, LBAsect=2931882, sector=2883624
Jan 24 04:49:29 gair kernel: end_request: I/O error, dev 03:05 (hda), sector 2883624
Jan 24 04:49:29 gair kernel: raid1: hda5: rescheduling block 2883624
Jan 24 04:49:29 gair kernel: raid1: hda5: unrecoverable I/O read error for block 2883624
Jan 24 04:49:29 gair kernel: EXT2-fs error (device md(9,2)): ext2_read_inode: unable to read inode block - inode=174643, block=360453

Jan 24 04:49:35 gair kernel: hda: dma_intr: status=0x51 { DriveReady SeekComplete Error }
Jan 24 04:49:35 gair kernel: hda: dma_intr: error=0x40 { UncorrectableError }, LBAsect=2931882, sector=2883624
Jan 24 04:49:35 gair kernel: end_request: I/O error, dev 03:05 (hda), sector 2883624
Jan 24 04:49:35 gair kernel: raid1: hda5: rescheduling block 2883624
Jan 24 04:49:35 gair kernel: raid1: hda5: unrecoverable I/O read error for block 2883624
Jan 24 04:49:35 gair kernel: EXT2-fs error (device md(9,2)): ext2_read_inode: unable to read inode block - inode=174651, block=360453

Jan 24 04:49:53 gair kernel: hda: dma_intr: status=0x51 { DriveReady SeekComplete Error }
Jan 24 04:49:53 gair kernel: hda: dma_intr: error=0x40 { UncorrectableError }, LBAsect=2931882, sector=2883624
Jan 24 04:49:53 gair kernel: end_request: I/O error, dev 03:05 (hda), sector 2883624
Jan 24 04:49:53 gair kernel: raid1: hda5: rescheduling block 2883624
Jan 24 04:49:53 gair kernel: raid1: hda5: unrecoverable I/O read error for block 2883624
Jan 24 04:49:53 gair kernel: EXT2-fs error (device md(9,2)): ext2_read_inode: unable to read inode block - inode=174630, block=360453

= = = = = = = = = = = 

I considered using the log watching system to keep an eye on /var/log/messages
all the time for suspicious lines - for instance searching for "reconstruct",
"SeekComplete Error" etc.  Probably looking for "raid1: h" would be fine.

If I was going to do this, I would write a perl script which uses the File::Tail
CPAN module: 

   http://search.cpan.org/doc/MGRABNAR/File-Tail-0.98/Tail.pm
   http://www.cpan.org/modules/00modlist.long.html

File::Tail needs Time::HiRes too.

   http://search.cpan.org/search?module=Time::HiRes
   http://search.cpan.org/doc/JHI/perl-5.7.2/ext/Time/HiRes/HiRes.pm

       This is actually an RPM module with a library!

   http://rpmfind.net/linux/rpm2html/search.php?query=perl-Time-HiRes

File::Tail can keep watching and pattern matching lines of a growing log file, 
and can apparently cope gracefully with it disappearing, during log rotation,
for instance.

But I haven't done this yet.  I would need to figure out a way of making the 
the script take only one action in response to a dozen or so log-file lines.
I am sure it could be done, but for now, I have adopted a simpler approach.

Hourly cron job to check /proc/mdstat
=====================================

Here are two self-explanatory shell scripts which live in /root/.  The first
one is called as an hourly cron job.  The second is only called if an error
is found.  So I would expect this system to report a hard drive failure within
an hour of it occuring, and then every hour until it was fixed or the script
disabled or doctored.

= = = = = = = = = = = 

raid-hourly-test.sh
-------------------

#!/bin/bash

# To be run every hour, to check the state of "cat /proc/mdstat".
#
# If there is a problem with any of the HD partitions, generate an 
# email to me.
# 
# The healthy state of it is:
#
# -----------------------------------------------------------
#
# Personalities : [raid1] 
# read_ahead 1024 sectors
# md5 : active raid1 hda9[0] hdc9[1]
#       10080384 blocks [2/2] [UU]
#       
# md2 : active raid1 hda8[0] hdc8[1]
#       1048704 blocks [2/2] [UU]
#       
# md3 : active raid1 hda7[0] hdc7[1]
#       524544 blocks [2/2] [UU]
#       
# md0 : active raid1 hda6[0] hdc6[1]
#       4194688 blocks [2/2] [UU]
#       
# md1 : active raid1 hda5[0] hdc5[1]
#       4194688 blocks [2/2] [UU]
#       
# md4 : active raid1 hda1[0] hdc1[1]
#       20544 blocks [2/2] [UU]
#       
# unused devices: <none>
#
# -----------------------------------------------------------
#
# An unhealthy one would have lines like:
#
#       10080384 blocks [2/1] [_U]  I have observed.
#       10080384 blocks [2/1] [U_]  I guess.
#       10080384 blocks [2/0] [__]  I guess.

# For debugging, comment the next line and doctor the file to be bad.

cat /proc/mdstat > /root/mdstat-for-test.txt

# I wamt to look for any occurence of underscore, except for the one 
# in "read_ahead".
#
# One way is to look for the word "block" and then for any underscore 
# which follows, but I can't see how to do that with grep. 
#
# Instead, looking for any character not "d" followed by an underscore.

if grep [^d]_ /root/mdstat-for-test.txt > /dev/null; then

   echo Mailing RAID hard disk error report!!!
   /root/raid-mail-problem.sh
fi   

= = = = = = = = = = = 

raid-mail-problem.sh
--------------------

#!/bin/bash
#
# Generate an email to me reporting a problem with hard drives 
# in the RAID system.
#
# Include the current /proc/mdstat and some recent lines from 
# /var/log/messages.  Assemble the email in a file.

echo To: xx@yyyy.zzz           > /root/mdstat-for-email.txt
echo Subject: !! Hard drive failure in gair RAID array !! >> /root/mdstat-for-email.txt
echo                          >> /root/mdstat-for-email.txt

cat /proc/mdstat              >> /root/mdstat-for-email.txt
echo                          >> /root/mdstat-for-email.txt

echo Some recent /var/log/messages lines: >> /root/mdstat-for-email.txt
echo                          >> /root/mdstat-for-email.txt
tail -n 50 /var/log/messages  >> /root/mdstat-for-email.txt

sendmail < /root/mdstat-for-email.txt

= = = = = = = = = = = 

Thanks to those who wrote the software RAID system!

I am running two IBM 60GXP drives in RAID1, and am keen to install a 
third, and have a slot to clone to a fourth removable drive for
offsite backup.  That is another story, but I plan to boot from another
HD, so the second and third RAID1 drives are not mounted or part of 
a RAID system.  Then I plan to use dd to copy every bit of one of
these drives to an identical type of drive in the slot.   That 
removable drive should be able to boot up on its own in another machine.

  - Robin

//  Robin Whittle                   http://www.firstpr.com.au
//  Melbourne, Australia            http://fondlyandfirmly.com
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html