Re: [RAID] Scripts watch SW RAID1 for HD failure - mdctl

Robin Whittle <rw@firstpr.com.au> · Mon, 04 Feb 2002 23:25:36 +1100

Junaid Rizvi wrote:

> What about  mdctl --monitor ?

I hadn't heard of this.   A web search finds some references to it.

It is not mentioned in the two pieces of doco I am relying on:

   http://www.linuxdoc.org/FAQ/Linux-RAID-FAQ/

     Linux-RAID FAQ   Gregory Leblanc
     gleblanc (at) cu-portland.edu
     Revision v0.0.10    24 April 2001  Revised by: gml

  http://www.linuxdoc.org/HOWTO/Software-RAID-HOWTO.html

     The Software-RAID HOWTO
     Jakob Østergaard ( jakob@ostenfeld.dk)
     v. 0.90.7 19th of January 2000 

I don't have this program on my computer.  I am approaching this as a
user, installing Red Hat 7.2, rather than someone who is involved in
programming the RAID code or interested in its internals.

After searching this mailing list, I found that mdctl lives about 1000
km from here:

   http://www.cse.unsw.edu.au/~neilb/source/mdctl/

and has been discussed on this list since June last year.   I read some
messages on this list but do not follow all threads.  

     mdctl is a single program that can be used to control Linux md 
     devices. It is intended to provide all the functionality of the 
     mdtools and raidtools but with a very different interface.

     mdctl can perform all functions without a configuration file. 
     There is the option of using a configuration file, but not in 
     the same way that raidtools uses one.

     raidtools uses a configuration file to describe how to create 
     a RAID array, and also uses this file partially to start a 
     previously created RAID array.

     Further, raidtools requires the configuration file for such things
     as stopping a raid array, which needs to know nothing about the 
     array.

After downloading the source and looking at the man page, I could find
no such option "--monitor".  Looking further, I find that in ReadMe.c, I
find something on a "--monitor" option, which I think is a synonym for
"Follow":

 For follow/monitor:

   --mail=       -m   : Address to mail alerts of failure to
   --program=    -p   : Program to run when an event is detected
   --alert=           : same as --program
   --delay=      -d   : seconds of delay between polling state. 
                        default=60

Yes - it looks like mdctl can keep an eye on the RAID system and mail
reports and run programs when something goes wrong.   I will investigate
further.   

I would like a way of ensuring the report system really works, without
actually having a RAID failure.  I suppose I could doctor the source to
achieve this.   Looking at Monitor.c, it does not yet add anything to
the system logs if there is a failure.

    Every few seconds, scan every md device looking for changes
    When a change is found, log it, possibly run the alert command,
    and possibly send Email

    For each array, we record:
       Update time
       active/working/failed/spare drives
       State of each device.

     If the update time changes, check out all the data again
     It is possible that we cannot get the state of each device
     due to bugs in the md kernel module.

     if active_drives decreases, generate a "Fail" event
     if active_drives increases, generate a "SpareActive" event

     if we detect an array with active<raid and spare==0
     we look at other arrays that have same spare-group
     If we find one with active==raid and spare>0,
     and if we can get_disk_info and find a name
     Then we hot-remove and hot-add to the other array

This last paragraph seems to replicate what I thought was the automatic
function of the existing RAID software - to add in a spare if necessary. 
I would want to be sure that there was no conflict.   This RAID stuff is
critical and hard to realistically test.   

When I first got RAID1 going on a RH6.0 installation, I tested it by
unplugging one of the drives whilst compiling the kernal.  There was a
flurry of error messages for quite a while, but the system kept running
perfectly, which greatly impressed me.  Rebooting with the drive plugged
in caused it to be automatically resynched - which was also impressive. 
That has been the extent of my testing. 

   - Robin
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html