Re: OK, Now this is really weird

NeilBrown <neilb@xxxxxxx> · Sun, 27 Feb 2011 08:34:33 +1100

On Sat, 26 Feb 2011 11:35:11 +0000 Mathias Burén <mathias.buren@xxxxxxxxx>
wrote:

> On 26 February 2011 11:20, Leslie Rhorer <lrhorer@xxxxxxxxxxx> wrote:
> >
> >
> >> -----Original Message-----
> >> From: linux-raid-owner@xxxxxxxxxxxxxxx [mailto:linux-raid-
> >> owner@xxxxxxxxxxxxxxx] On Behalf Of Jeff Woods
> >> Sent: Saturday, February 26, 2011 1:36 AM
> >> To: lrhorer@xxxxxxxxxxx
> >> Cc: 'Linux RAID'
> >> Subject: Re: OK, Now this is really weird
> >>
> >> Quoting Leslie Rhorer <lrhorer@xxxxxxxxxxx>:
> >> >     I have a pair of drives each of whose 3 partitions are members of a
> >> > set of 3 RAID arrays.  One of the two drives had a flaky power
> >> connection
> >> > which I thought I had fixed, but I guess not, because the drive was
> >> taken
> >> > offline again on Tuesday.  The significant issue, however, is that both
> >> > times the drive failed, mdadm behaved really oddly.  The first time I
> >> > thought it might just be some odd anomaly, but the second time it did
> >> > precisely the same thing.  Both times, when the drive was de-registered
> >> by
> >> > udev, the first two arrays properly responded to the failure, but the
> >> third
> >> > array did not.  Here is the layout:
> >>
> >> [snip lots of technical details]
> >>
> >> >     So what gives?  /dev/sdk3 no longer even exists, so why hasn't it
> >> > been failed and removed on /dev /md3 like it has on /dev/md1 and
> >> /dev/md2?
> >>
> >> Is it possible there has been no I/O request for /dev/md3 since
> >> /dev/sdk failed?
> >
> >        Well, I thought about that.  It's swap space, so I suppose it's
> > possible.  I would have thought, however, that mdadm would fail a missing
> > member whether there is any I/O or not.
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> > the body of a message to majordomo@xxxxxxxxxxxxxxx
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >
> 
> I thought so as well. But how will mdadm know is the device is faulty,
> unless the device is generating errors? (which usually only happens on
> read and/or write)

With very recent mdadm the command

   mdadm -If sdXX

will find any md array that has /dev/sdXX as a member and will fail and
remove it.
Note the device name is 'sdxx', not '/dev/something'.  This is because that
at the time you want to do this, udev has probably removed all trace
from /dev so you need to use the name mentioned in /proc/mdstat
or /sys/block/mdXX/md/dev-$DEVNAME

You can set up a udev rule to run mdadm like this automatically when a device
is hot-unplugged.
something like

 SUBSYSTEM=="block", ACTION=="remove", RUN+="/sbin/mdadm -If $name --path $env{ID_PATH}"

NeilBrown
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html