RAID5 in strange state

Frank Baumgart <frank.baumgart@xxxxxxx> · Wed, 08 Apr 2009 23:29:20 +0200

Dear List,

I use MD RAID 5 since some years and so far had to recover from single
disk failures a few times which was always successful.
Now though, I am puzzled.

Setup:
Some PC with 3x WD 1 TB SATA disk drives set up as RAID 5 using kernel
2.6.27.21 (now); the array ran fine for at least 6 months now.

I check the state of the RAID every few days with looking at
/proc/mdstat manually.
Apparently one drive had been kicked out of the array 4 days ago without
me noticing it.
Root cause seemed to be bad cabling but is not confirmed yet.
Anyway, the disc in question ("sde") reports 23 UDMA_CRC errors,
compared to 0 about 2 weeks ago.
Reading the complete device just now via DD still reports those 23
errors but no new ones.

Well, RAID 5 should survive a single disc failure (again) but after a
reboot (due to non-RAID related reasons) the RAID came up as "md0 stopped".

cat /proc/mdstat

Personalities :
md0 : inactive sdc1[1](S) sdd1[2](S) sde1[0](S)
      2930279424 blocks

unused devices: <none>

What's that?
First, documentation on the web is rather outdated and/or incomplete.
Second, my guess that "(S)" represents a spare is backuped up by the
kernel source.

mdadm --examine [devices] gives consistent reports about the RAID 5
structure as:

          Magic : a92b4efc
        Version : 0.90.00
           UUID : ec4fdb7b:e57733c0:4dc42c07:36d99219
  Creation Time : Wed Dec 24 11:40:29 2008
     Raid Level : raid5
  Used Dev Size : 976759808 (931.51 GiB 1000.20 GB)
     Array Size : 1953519616 (1863.02 GiB 2000.40 GB)
   Raid Devices : 3
  Total Devices : 3
Preferred Minor : 0
...
         Layout : left-symmetric
     Chunk Size : 256K

The state though differs:

sdc1:
    Update Time : Tue Apr  7 20:51:33 2009
          State : clean
 Active Devices : 2
Working Devices : 2
 Failed Devices : 0
  Spare Devices : 0
       Checksum : ccff6a15 - correct
         Events : 177920
...
      Number   Major   Minor   RaidDevice State
this     1       8       33        1      active sync   /dev/sdc1

   0     0       0        0        0      removed
   1     1       8       33        1      active sync   /dev/sdc1
   2     2       8       49        2      active sync   /dev/sdd1

sdd1:
    Update Time : Tue Apr  7 20:51:33 2009
          State : clean
 Active Devices : 2
Working Devices : 2
 Failed Devices : 0
  Spare Devices : 0
       Checksum : ccff6a27 - correct
         Events : 177920

         Layout : left-symmetric
     Chunk Size : 256K

      Number   Major   Minor   RaidDevice State
this     2       8       49        2      active sync   /dev/sdd1

   0     0       0        0        0      removed
   1     1       8       33        1      active sync   /dev/sdc1
   2     2       8       49        2      active sync   /dev/sdd1

sde1:
    Update Time : Fri Apr  3 15:00:31 2009
          State : active
 Active Devices : 3
Working Devices : 3
 Failed Devices : 0
  Spare Devices : 0
       Checksum : ccf463ec - correct
         Events : 7

         Layout : left-symmetric
     Chunk Size : 256K

      Number   Major   Minor   RaidDevice State
this     0       8       65        0      active sync   /dev/sde1

   0     0       8       65        0      active sync   /dev/sde1
   1     1       8       33        1      active sync   /dev/sdc1
   2     2       8       49        2      active sync   /dev/sdd1

sde is the device that failed once and was kicked out of the array.
The update time reflects that if I interprete that right.
But how can sde1 status claim 3 active and working devices? IMO that's
way off.

Now, my assumption:
I think I should be able to either remove sde temporarily and just
restart the degraded array from sdc1/sdd1.
correct?

My backup is a few days old and I would really like to keep the work on
the RAID done in the meantime.

If the answer is just 2 or 3 mdadm command lines, I am yours :-)

Best regards

Frank Baumgart

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html