Re: RAID showing all devices as spares after partial unplug

Mike Hartman <mike@xxxxxxxxxxxxxxxxxxxx> · Sat, 17 Sep 2011 18:16:27 -0400

I should add that the mdadm command in question actually ends in
/dev/md0, not /dev/md3 (that's for another array). So the device name
for the array I'm seeing in mdstat DOES match the one in the assemble
command.

On Sat, Sep 17, 2011 at 4:39 PM, Mike Hartman <mike@xxxxxxxxxxxxxxxxxxxx> wrote:
> I have 11 drives in a RAID 6 array. 6 are plugged into one esata
> enclosure, the other 4 are in another. These esata cables are prone to
> loosening when I'm working on nearby hardware.
>
> If that happens and I start the host up, big chunks of the array are
> missing and things could get ugly. Thus I cooked up a custom startup
> script that verifies each device is present before starting the array
> with
>
> mdadm --assemble --no-degraded -u 4fd7659f:12044eff:ba25240d:
> de22249d /dev/md3
>
> So I thought I was covered. In case something got unplugged I would
> see the array failing to start at boot and I could shut down, fix the
> cables and try again. However, I hit a new scenario today where one of
> the plugs was loosened while everything was turned on.
>
> The good news is that there should have been no activity on the array
> when this happened, particularly write activity. It's a big media
> partition and sees much less writing then reading. I'm also the only
> one that uses it and I know I wasn't transferring anything. The system
> also seems to have immediately marked the filesystem read-only,
> because I discovered the issue when I went to write to it later and
> got a "read-only filesystem" error. So I believe the state of the
> drives should be the same - nothing should be out of sync.
>
> However, I shut the system down, fixed the cables and brought it back
> up. All the devices are detected by my script and it tries to start
> the array with the command I posted above, but I've ended up with
> this:
>
> md0 : inactive sdn1[1](S) sdj1[9](S) sdm1[10](S) sdl1[11](S)
> sdk1[12](S) md3p1[8](S) sdc1[6](S) sdd1[5](S) md1p1[4](S) sdf1[3](S)
> sdh1[0](S)
>       16113893731 blocks super 1.2
>
> Instead of all coming back up, or still showing the unplugged drives
> missing, everything is a spare? I'm suitably disturbed.
>
> It seems to me that if the data on the drives still reflects the
> last-good data from the array (and since no writing was going on it
> should) then this is just a matter of some metadata getting messed up
> and it should be fixable. Can someone please walk me through the
> commands to do that?
>
> Mike
>
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html