Summary: What the log files look like when a hard drive fails. Scripts to email a problem report when /proc/mdstat looks bad. I have been running software RAID1 on Red Hat 7.1 since June last year. All was well until recently when there were unrecoverable disk errors. It turned out one drive failed in October and silly me had not noticed! (By the way, both drives were IBM DTLA 75GXP drives - 20 gigs - and I suggest anyone with this series of drive backup their data and take these lemons back! I have had three out of three fail, and the drives are notorious, as Googling DTLA failure shows. There is even a class action suit about them.) I had not known what sort of errors to look for if a drive failed. There are two ways of checking. One is to keep an eye on the results of: cat /proc/mdstat The other approach is to look at /var/log/messages for specific things. It wasn't clear what these might be, but here are some sets of lines I found. I found a reliable way of searching 7 months of messages for hard drive trouble was to search for: "raid1: h". This was for one of the two drives failing, and the system carrying on working fine. This was during normal operation, not at boot time. hdc5 failed and md2 was reliant on hda5 alone. = = = = = = = = = = = Oct 21 00:10:43 mybox kernel: hdc: dma_intr: status=0x51 { DriveReady SeekComplete Error } Oct 21 00:10:43 mybox kernel: hdc: dma_intr: error=0x40 { UncorrectableError }, LBAsect=230336, sector=182056 Oct 21 00:10:43 mybox kernel: end_request: I/O error, dev 16:05 (hdc), sector 182056 Oct 21 00:10:43 mybox kernel: raid1: Disk failure on hdc5, disabling device. Oct 21 00:10:43 mybox kernel: ^IOperation continuing on 1 devices Oct 21 00:10:43 mybox kernel: raid1: hdc5: rescheduling block 182056 Oct 21 00:10:43 mybox kernel: md: recovery thread got woken up ... Oct 21 00:10:43 mybox kernel: md2: no spare disk to reconstruct array! -- continuing in degraded mode Oct 21 00:10:43 mybox kernel: md: recovery thread finished ... Oct 21 00:10:43 mybox kernel: dirty sb detected, updating. Oct 21 00:10:43 mybox kernel: md: updating md2 RAID superblock on device Oct 21 00:10:43 mybox kernel: hda5 [events: 0000006f](write) hda5's sb offset: 3172736 Oct 21 00:10:43 mybox kernel: (skipping faulty hdc5 ) Oct 21 00:10:53 mybox kernel: . Oct 21 00:10:53 mybox kernel: raid1: hda5: redirecting sector 182056 to another mirror Oct 21 00:10:53 mybox kernel: hdc: dma_intr: status=0x51 { DriveReady SeekComplete Error } Oct 21 00:10:53 mybox kernel: hdc: dma_intr: error=0x40 { UncorrectableError }, LBAsect=230336, sector=182064 Oct 21 00:10:53 mybox kernel: end_request: I/O error, dev 16:05 (hdc), sector 182064 Oct 21 00:10:53 mybox kernel: raid1: hdc5: rescheduling block 182064 Oct 21 00:10:53 mybox kernel: md: recovery thread got woken up ... Oct 21 00:10:53 mybox kernel: md2: no spare disk to reconstruct array! -- continuing in degraded mode Oct 21 00:10:53 mybox kernel: md: recovery thread finished ... Oct 21 00:10:53 mybox kernel: raid1: hda5: redirecting sector 182064 to another mirror Oct 21 00:10:53 mybox kernel: hdc: dma_intr: status=0x51 { DriveReady SeekComplete Error } Oct 21 00:10:54 mybox kernel: hdc: dma_intr: error=0x40 { UncorrectableError }, LBAsect=230336, sector=182072 Oct 21 00:10:54 mybox kernel: end_request: I/O error, dev 16:05 (hdc), sector 182072 Oct 21 00:10:54 mybox kernel: raid1: hdc5: rescheduling block 182072 Oct 21 00:10:54 mybox kernel: md: recovery thread got woken up ... Oct 21 00:10:54 mybox kernel: md2: no spare disk to reconstruct array! -- continuing in degraded mode Oct 21 00:10:54 mybox kernel: md: recovery thread finished ... Oct 21 00:10:54 mybox kernel: raid1: hda5: redirecting sector 182072 to another mirror = = = = = = = = = = = Here is the error reports when another partition hdc6 on the same drive failed, taking md3 down to one drive too: hda6. = = = = = = = = = = = Jan 18 04:03:31 mybox kernel: hdc: dma_intr: status=0x51 { DriveReady SeekComplete Error } Jan 18 04:03:31 mybox kernel: hdc: dma_intr: error=0x40 { UncorrectableError }, LBAsect=8757269, sector=2363336 Jan 18 04:03:31 mybox kernel: end_request: I/O error, dev 16:06 (hdc), sector 2363336 Jan 18 04:03:31 mybox kernel: raid1: Disk failure on hdc6, disabling device. Jan 18 04:03:31 mybox kernel: ^IOperation continuing on 1 devices Jan 18 04:03:31 mybox kernel: raid1: hdc6: rescheduling block 2363336 Jan 18 04:03:31 mybox kernel: md: recovery thread got woken up ... Jan 18 04:03:31 mybox kernel: md3: no spare disk to reconstruct array! -- continuing in degraded mode Jan 18 04:03:31 mybox kernel: md2: no spare disk to reconstruct array! -- continuing in degraded mode Jan 18 04:03:31 mybox kernel: md: recovery thread finished ... Jan 18 04:03:31 mybox kernel: raid1: dirty sb detected, updating. Jan 18 04:03:31 mybox kernel: md: updating md3 RAID superblock on device Jan 18 04:03:31 mybox kernel: md: hda6 [events: 00000075](write) hda6's sb offset: 2104384 Jan 18 04:03:31 mybox kernel: md: (skipping faulty hdc6 ) Jan 18 04:03:31 mybox kernel: raid1: hda6: redirecting sector 2363336 to another mirror = = = = = = = = = = = Here are three similar sets of error messages when there were errors on hda5 which could not be recovered from. This is when I noticed things going wrong - a file not being written due to a directory being unusable because of a corrupted file. = = = = = = = = = = = Jan 24 04:49:29 gair kernel: hda: dma_intr: status=0x51 { DriveReady SeekComplete Error } Jan 24 04:49:29 gair kernel: hda: dma_intr: error=0x40 { UncorrectableError }, LBAsect=2931882, sector=2883624 Jan 24 04:49:29 gair kernel: end_request: I/O error, dev 03:05 (hda), sector 2883624 Jan 24 04:49:29 gair kernel: raid1: hda5: rescheduling block 2883624 Jan 24 04:49:29 gair kernel: raid1: hda5: unrecoverable I/O read error for block 2883624 Jan 24 04:49:29 gair kernel: EXT2-fs error (device md(9,2)): ext2_read_inode: unable to read inode block - inode=174643, block=360453 Jan 24 04:49:35 gair kernel: hda: dma_intr: status=0x51 { DriveReady SeekComplete Error } Jan 24 04:49:35 gair kernel: hda: dma_intr: error=0x40 { UncorrectableError }, LBAsect=2931882, sector=2883624 Jan 24 04:49:35 gair kernel: end_request: I/O error, dev 03:05 (hda), sector 2883624 Jan 24 04:49:35 gair kernel: raid1: hda5: rescheduling block 2883624 Jan 24 04:49:35 gair kernel: raid1: hda5: unrecoverable I/O read error for block 2883624 Jan 24 04:49:35 gair kernel: EXT2-fs error (device md(9,2)): ext2_read_inode: unable to read inode block - inode=174651, block=360453 Jan 24 04:49:53 gair kernel: hda: dma_intr: status=0x51 { DriveReady SeekComplete Error } Jan 24 04:49:53 gair kernel: hda: dma_intr: error=0x40 { UncorrectableError }, LBAsect=2931882, sector=2883624 Jan 24 04:49:53 gair kernel: end_request: I/O error, dev 03:05 (hda), sector 2883624 Jan 24 04:49:53 gair kernel: raid1: hda5: rescheduling block 2883624 Jan 24 04:49:53 gair kernel: raid1: hda5: unrecoverable I/O read error for block 2883624 Jan 24 04:49:53 gair kernel: EXT2-fs error (device md(9,2)): ext2_read_inode: unable to read inode block - inode=174630, block=360453 = = = = = = = = = = = I considered using the log watching system to keep an eye on /var/log/messages all the time for suspicious lines - for instance searching for "reconstruct", "SeekComplete Error" etc. Probably looking for "raid1: h" would be fine. If I was going to do this, I would write a perl script which uses the File::Tail CPAN module: http://search.cpan.org/doc/MGRABNAR/File-Tail-0.98/Tail.pm http://www.cpan.org/modules/00modlist.long.html File::Tail needs Time::HiRes too. http://search.cpan.org/search?module=Time::HiRes http://search.cpan.org/doc/JHI/perl-5.7.2/ext/Time/HiRes/HiRes.pm This is actually an RPM module with a library! http://rpmfind.net/linux/rpm2html/search.php?query=perl-Time-HiRes File::Tail can keep watching and pattern matching lines of a growing log file, and can apparently cope gracefully with it disappearing, during log rotation, for instance. But I haven't done this yet. I would need to figure out a way of making the the script take only one action in response to a dozen or so log-file lines. I am sure it could be done, but for now, I have adopted a simpler approach. Hourly cron job to check /proc/mdstat ===================================== Here are two self-explanatory shell scripts which live in /root/. The first one is called as an hourly cron job. The second is only called if an error is found. So I would expect this system to report a hard drive failure within an hour of it occuring, and then every hour until it was fixed or the script disabled or doctored. = = = = = = = = = = = raid-hourly-test.sh ------------------- #!/bin/bash # To be run every hour, to check the state of "cat /proc/mdstat". # # If there is a problem with any of the HD partitions, generate an # email to me. # # The healthy state of it is: # # ----------------------------------------------------------- # # Personalities : [raid1] # read_ahead 1024 sectors # md5 : active raid1 hda9[0] hdc9[1] # 10080384 blocks [2/2] [UU] # # md2 : active raid1 hda8[0] hdc8[1] # 1048704 blocks [2/2] [UU] # # md3 : active raid1 hda7[0] hdc7[1] # 524544 blocks [2/2] [UU] # # md0 : active raid1 hda6[0] hdc6[1] # 4194688 blocks [2/2] [UU] # # md1 : active raid1 hda5[0] hdc5[1] # 4194688 blocks [2/2] [UU] # # md4 : active raid1 hda1[0] hdc1[1] # 20544 blocks [2/2] [UU] # # unused devices: <none> # # ----------------------------------------------------------- # # An unhealthy one would have lines like: # # 10080384 blocks [2/1] [_U] I have observed. # 10080384 blocks [2/1] [U_] I guess. # 10080384 blocks [2/0] [__] I guess. # For debugging, comment the next line and doctor the file to be bad. cat /proc/mdstat > /root/mdstat-for-test.txt # I wamt to look for any occurence of underscore, except for the one # in "read_ahead". # # One way is to look for the word "block" and then for any underscore # which follows, but I can't see how to do that with grep. # # Instead, looking for any character not "d" followed by an underscore. if grep [^d]_ /root/mdstat-for-test.txt > /dev/null; then echo Mailing RAID hard disk error report!!! /root/raid-mail-problem.sh fi = = = = = = = = = = = raid-mail-problem.sh -------------------- #!/bin/bash # # Generate an email to me reporting a problem with hard drives # in the RAID system. # # Include the current /proc/mdstat and some recent lines from # /var/log/messages. Assemble the email in a file. echo To: xx@yyyy.zzz > /root/mdstat-for-email.txt echo Subject: !! Hard drive failure in gair RAID array !! >> /root/mdstat-for-email.txt echo >> /root/mdstat-for-email.txt cat /proc/mdstat >> /root/mdstat-for-email.txt echo >> /root/mdstat-for-email.txt echo Some recent /var/log/messages lines: >> /root/mdstat-for-email.txt echo >> /root/mdstat-for-email.txt tail -n 50 /var/log/messages >> /root/mdstat-for-email.txt sendmail < /root/mdstat-for-email.txt = = = = = = = = = = = Thanks to those who wrote the software RAID system! I am running two IBM 60GXP drives in RAID1, and am keen to install a third, and have a slot to clone to a fourth removable drive for offsite backup. That is another story, but I plan to boot from another HD, so the second and third RAID1 drives are not mounted or part of a RAID system. Then I plan to use dd to copy every bit of one of these drives to an identical type of drive in the slot. That removable drive should be able to boot up on its own in another machine. - Robin // Robin Whittle http://www.firstpr.com.au // Melbourne, Australia http://fondlyandfirmly.com - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html