Re: Need urgent help in fixing raid5 array

Mike Myers <mikesm559@xxxxxxxxx> · Thu, 1 Jan 2009 22:19:00 -0800 (PST)

Ok, the bad MPT board is out, replaced by a SI3132, and I rejiggered the drives around so that all the drives are connected.  It brought me back to the main problem.  md2 is running fine, md1 cannot assemble with only 5 drives out of the 7.

Here is the data you requested:

(none):~ # cat /etc/mdadm.conf
DEVICE partitions
ARRAY /dev/md0 level=raid0 UUID=9412e7e1:fd56806c:0f9cc200:95c7ed98
ARRAY /dev/md3 level=raid0 UUID=67999c69:4a9ca9f9:7d4d6b81:91c98b1f
ARRAY /dev/md1 level=raid5 UUID=b737af5c:7c0a70a9:99a648a0:7f693c7d
ARRAY /dev/md2 level=raid5 UUID=e70e0697:a10a5b75:941dd76f:196d9e4e
#ARRAY /dev/md2 level=raid0 UUID=658369ee:23081b79:c990e3a2:15f38c70
#ARRAY /dev/md3 level=raid0 UUID=e2c910ae:0052c38e:a5e19298:0d057e34
MAILADDR root

(md0 and md3 are old arrays that have since been removed - no disks with their uuids are in the system)

(none):~> mdadm -D /dev/md1
mdadm: md device /dev/md1 does not appear to be active.

(none):~> mdadm -D /dev/md2
/dev/md2:
        Version : 00.90.03
  Creation Time : Tue Aug 19 21:31:10 2008
     Raid Level : raid5
     Array Size : 5860559616 (5589.07 GiB 6001.21 GB)
  Used Dev Size : 976759936 (931.51 GiB 1000.20 GB)
   Raid Devices : 7
  Total Devices : 7
Preferred Minor : 2
    Persistence : Superblock is persistent

    Update Time : Thu Jan  1 21:59:20 2009
          State : clean
 Active Devices : 7
Working Devices : 7
 Failed Devices : 0
  Spare Devices : 0

         Layout : left-symmetric
     Chunk Size : 128K

           UUID : e70e0697:a10a5b75:941dd76f:196d9e4e
         Events : 0.1438838

    Number   Major   Minor   RaidDevice State
       0       8      209        0      active sync   /dev/sdn1
       1       8      129        1      active sync   /dev/sdi1
       2       8      177        2      active sync   /dev/sdl1
       3       8       17        3      active sync   /dev/sdb1
       4       8       33        4      active sync   /dev/sdc1
       5       8       65        5      active sync   /dev/sde1
       6       8      193        6      active sync   /dev/sdm1

(md1 is comprised of sdd1 sdf1 sdg1 sdh1 sdj1 sdk1 sdo1) 

(none):~> mdadm --examine /dev/sdd1 /dev/sdf1 /dev/sdg1 /dev/sdh1
/dev/sdd1:
          Magic : a92b4efc
        Version : 1.0
    Feature Map : 0x1
     Array UUID : b737af5c:7c0a70a9:99a648a0:7f693c7d
           Name : 1
  Creation Time : Fri Nov 23 12:15:39 2007
     Raid Level : raid5
   Raid Devices : 7

 Avail Dev Size : 1953519728 (931.51 GiB 1000.20 GB)
     Array Size : 11721117696 (5589.06 GiB 6001.21 GB)
  Used Dev Size : 1953519616 (931.51 GiB 1000.20 GB)
   Super Offset : 1953519984 sectors
          State : clean
    Device UUID : 8ea6369b:cfd1c103:845a1a65:d8b1f254

Internal Bitmap : -234 sectors from superblock
    Update Time : Wed Dec 31 22:43:01 2008
       Checksum : ce94ad09 - correct
         Events : 2295122

         Layout : left-symmetric
     Chunk Size : 128K

    Array Slot : 7 (0, failed, failed, failed, 3, 4, failed, 5, 6)
   Array State : u__uuUu 4 failed
/dev/sdf1:
          Magic : a92b4efc
        Version : 1.0
    Feature Map : 0x1
     Array UUID : b737af5c:7c0a70a9:99a648a0:7f693c7d
           Name : 1
  Creation Time : Fri Nov 23 12:15:39 2007
     Raid Level : raid5
   Raid Devices : 7

 Avail Dev Size : 1953519728 (931.51 GiB 1000.20 GB)
     Array Size : 11721117696 (5589.06 GiB 6001.21 GB)
  Used Dev Size : 1953519616 (931.51 GiB 1000.20 GB)
   Super Offset : 1953519984 sectors
          State : clean
    Device UUID : 50c2e80e:e36efc92:5ddac3b0:4d847236

Internal Bitmap : -234 sectors from superblock
    Update Time : Wed Dec 31 22:43:01 2008
       Checksum : feaab82b - correct
         Events : 2295122

         Layout : left-symmetric
     Chunk Size : 128K

    Array Slot : 5 (0, failed, failed, failed, 3, 4, failed, 5, 6)
   Array State : u__uUuu 4 failed
/dev/sdg1:
          Magic : a92b4efc
        Version : 1.0
    Feature Map : 0x1
     Array UUID : b737af5c:7c0a70a9:99a648a0:7f693c7d
           Name : 1
  Creation Time : Fri Nov 23 12:15:39 2007
     Raid Level : raid5
   Raid Devices : 7

 Avail Dev Size : 1953519728 (931.51 GiB 1000.20 GB)
     Array Size : 11721117696 (5589.06 GiB 6001.21 GB)
  Used Dev Size : 1953519616 (931.51 GiB 1000.20 GB)
   Super Offset : 1953519984 sectors
          State : clean
    Device UUID : c9809a0c:bd4eabbe:c110a056:0cdd3691

Internal Bitmap : -234 sectors from superblock
    Update Time : Fri Jan  2 17:30:13 2009
       Checksum : 28b13f46 - correct
         Events : 2295116

         Layout : left-symmetric
     Chunk Size : 128K

    Array Slot : 0 (0, 1, failed, failed, 3, 4, failed, 5, 6)
   Array State : Uu_uuuu 3 failed
/dev/sdh1:
          Magic : a92b4efc
        Version : 1.0
    Feature Map : 0x1
     Array UUID : b737af5c:7c0a70a9:99a648a0:7f693c7d
           Name : 1
  Creation Time : Fri Nov 23 12:15:39 2007
     Raid Level : raid5
   Raid Devices : 7

 Avail Dev Size : 1953519728 (931.51 GiB 1000.20 GB)
     Array Size : 11721117696 (5589.06 GiB 6001.21 GB)
  Used Dev Size : 1953519616 (931.51 GiB 1000.20 GB)
   Super Offset : 1953519984 sectors
          State : clean
    Device UUID : c9809a0c:bd4eabbe:c110a056:0cdd3691

Internal Bitmap : -234 sectors from superblock
    Update Time : Wed Dec 31 22:43:01 2008
       Checksum : 28abe59d - correct
         Events : 2295122

         Layout : left-symmetric
     Chunk Size : 128K

    Array Slot : 0 (0, failed, failed, failed, 3, 4, failed, 5, 6)
   Array State : U__uuuu 4 failed

(none):~> mdadm --examine /dev/sdj1 /dev/sdk1 /dev/sdo1
/dev/sdj1:
          Magic : a92b4efc
        Version : 1.0
    Feature Map : 0x1
     Array UUID : b737af5c:7c0a70a9:99a648a0:7f693c7d
           Name : 1
  Creation Time : Fri Nov 23 12:15:39 2007
     Raid Level : raid5
   Raid Devices : 7

 Avail Dev Size : 1953519728 (931.51 GiB 1000.20 GB)
     Array Size : 11721117696 (5589.06 GiB 6001.21 GB)
  Used Dev Size : 1953519616 (931.51 GiB 1000.20 GB)
   Super Offset : 1953519984 sectors
          State : clean
    Device UUID : c61e1d1a:b123f01a:4098ab5e:e8932eb6

Internal Bitmap : -234 sectors from superblock
    Update Time : Wed Dec 31 22:43:01 2008
       Checksum : bf7696f0 - correct
         Events : 2295122

         Layout : left-symmetric
     Chunk Size : 128K

    Array Slot : 8 (0, failed, failed, failed, 3, 4, failed, 5, 6)
   Array State : u__uuuU 4 failed
/dev/sdk1:
          Magic : a92b4efc
        Version : 1.0
    Feature Map : 0x1
     Array UUID : b737af5c:7c0a70a9:99a648a0:7f693c7d
           Name : 1
  Creation Time : Fri Nov 23 12:15:39 2007
     Raid Level : raid5
   Raid Devices : 7

 Avail Dev Size : 1953519728 (931.51 GiB 1000.20 GB)
     Array Size : 11721117696 (5589.06 GiB 6001.21 GB)
  Used Dev Size : 1953519616 (931.51 GiB 1000.20 GB)
   Super Offset : 1953519984 sectors
          State : clean
    Device UUID : f1417b9d:64d9c93d:c32d16e8:470ab7af

Internal Bitmap : -234 sectors from superblock
    Update Time : Wed Dec 31 22:43:01 2008
       Checksum : e8a17bad - correct
         Events : 2295122

         Layout : left-symmetric
     Chunk Size : 128K

    Array Slot : 4 (0, failed, failed, failed, 3, 4, failed, 5, 6)
   Array State : u__Uuuu 4 failed
/dev/sdo1:
          Magic : a92b4efc
        Version : 1.0
    Feature Map : 0x1
     Array UUID : b737af5c:7c0a70a9:99a648a0:7f693c7d
           Name : 1
  Creation Time : Fri Nov 23 12:15:39 2007
     Raid Level : raid5
   Raid Devices : 7

 Avail Dev Size : 1953519728 (931.51 GiB 1000.20 GB)
     Array Size : 11721117696 (5589.06 GiB 6001.21 GB)
  Used Dev Size : 1953519616 (931.51 GiB 1000.20 GB)
   Super Offset : 1953519984 sectors
          State : clean
    Device UUID : c9809a0c:bd4eabbe:c110a056:0cdd3691

Internal Bitmap : -234 sectors from superblock
    Update Time : Fri Jan  2 17:17:40 2009
       Checksum : 28b13bcd - correct
         Events : 2294980

         Layout : left-symmetric
     Chunk Size : 128K

    Array Slot : 0 (0, 1, failed, failed, 3, 4, failed, 5, 6)
   Array State : Uu_uuuu 3 failed

(none):~> cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4]
md2 : active raid5 sdn1[0] sdm1[6] sde1[5] sdc1[4] sdb1[3] sdl1[2] sdi1[1]
      5860559616 blocks level 5, 128k chunk, algorithm 2 [7/7] [UUUUUUU]

md1 : inactive sdh1[0](S) sdj1[8](S) sdd1[7](S) sdf1[5](S) sdk1[4](S)
      4883799040 blocks super 1.0

unused devices: <none>

I'm not seeing any errors on boot - all the drives come up now.  It's just that md can't put md1 back together again.  Once that happens, then I can try with lvm and see if I can't get the filesystem online.

Anything else that would be helpful?

I am happy to attach the whole bootup log, but it's a little long...

thanks VERY much!

Mike

----- Original Message ----
From: Justin Piszcz <jpiszcz@xxxxxxxxxxxxxxx>
To: Mike Myers <mikesm559@xxxxxxxxx>
Cc: linux-raid@xxxxxxxxxxxxxxx; john lists <john4lists@xxxxxxxxx>
Sent: Thursday, January 1, 2009 10:29:15 AM
Subject: Re: Need urgent help in fixing raid5 array

I think some output would be pertinent here:

mdadm -D /dev/md0..1..2 etc

cat /proc/mdstat

dmesg/syslog of the errors you are seeing etc

On Thu, 1 Jan 2009, Mike Myers wrote:

> The disks that are problematic are still online as far as the OS can tell.  I can do a dd from them and pull off data at the normal speeds, so I don't understand if that's the case why the backplane would be a problem here.  I can try and move them to another slot however (I have a 20 slot SATA backplane in there) and see if that changes how md deals with it.
>
> The OS sees the drive, it inits fine, but md shows it as removed and won't let me add it back to the array because of the "device being busy".  I don't understand the criteria that md uses to add a drive I guess.  The uuid looks fine, and if the events is off, then the -f flag should take care of that.  I've never seen a "device busy" failure on an add before.
>
> thx
> mike
>
>
>
>
> ----- Original Message ----
> From: Justin Piszcz <jpiszcz@xxxxxxxxxxxxxxx>
> To: Mike Myers <mikesm559@xxxxxxxxx>
> Cc: linux-raid@xxxxxxxxxxxxxxx; john lists <john4lists@xxxxxxxxx>
> Sent: Thursday, January 1, 2009 7:40:21 AM
> Subject: Re: Need urgent help in fixing raid5 array
>
>
>
> On Thu, 1 Jan 2009, Mike Myers wrote:
>
>> Well, thanks for all your help last month.  As i posted, things came
>> back up and I survived the failure.  Now, I have yet another problem.
>> :(  After 5 years of running a linux server as a dedicated NAS, I am
>> hitting some very weird problems.  This server started as an single
>> processor AMD system with 4 320GB drives, and has been upgraded
>> multiple times so that it is now a quad core Intel rackmounted 4U
>> system with 14 1 TB drives and I have never lost data in any of the
>> upgrades of CPU, motherboard and disk controller hardware and disk
>> drives.  Now after last month's near death experience I am faced with
>> another serious problem in less than a month.  Any help you guys could
>> give me would be most appreciated.  This is a sucky way to start the
>> new year.
>>
>> The array I had problems with last month (md2
>> comprised of 7 1 TB drives in a RAID5 config) is running just fine.
>> md1, which is built of 7 1 TB hitachi 7K1000 drives is now having
>> problems.  We returned from a 10 day family visit with everything
>> running just fine.  There ws a brief power outage today, abt 3 mins,
>> but I can't see how that could be related as the server is on a high
>> quality rackmount 3U APC UPS that handled the outage just fine.  I was
>> working on the system getting X to work again after a nvidia driver
>> update, and when that was working fine, checked the disks to discover
>> that md1 was in a degraded state, with /dev/sdl1 kicked out of the
>> array (removed).  I tried to do a dd from the drive to verify it's
>> location in the rack, but I got an i/o error.  This was most odd, and
>> so went to the rack and pulled the disk and reinserted it.  No system
>> log entries recorded the device being pulled or re-installed.  So I am
>> thinking that a cable somehow
>> has come loose.  I power the system
>> down, pull it out of the rack, look at the cable that goes to the
>> drive, everything looks fine.
>>
>> So I reboot the system, and now
>> the array won't come online because now in addition to the drive that
>> shows as (removed), one of the other drives shows as a faulty spare.
>> Well, learning from the last go around, I reassemble the array with the
>> --force option, and the array comes back up.  But LVM won't come back
>> up because it sees the physical volume that maps to md1 as missing.
>> Now I am very concerned.  After trying a bunch of things, I do a
>> pvcreate with the missing UUID on md1, restart the vg and the logical
>> volume comes back up.  I was thinking I may have told lvm to use an
>> array of bad data, but to my surprise, I mounted the filesystem and
>> everything looked intact!  Ok, sometimes you win.  So I do one more
>> reboot to get the system back up in multiuser so I can back up some of
>> the more important media stored on the volume (it's got about 10 Tb
>> used, but most of that is PVR recordings, but there is a lot of ripped
>> music and DVD's that I really don't
>> want to rerip) on a another server that has some space on it while I figure out what has been happening.
>>
>> The
>> reboot again fails because of a problem with md1.  This time, another
>> one of the drives shows as removed (/dev/sdm1), and I can't reassemble
>> the array with a --force option.  It is acting like /dev/sdl1 (the
>> other removed unit), and even though I can read from the drives fine,
>> their UUID is fine, etc..., md does not consider them as part of the
>> array.  /dev/sdo1 (which was the drive that looked like a faulty spare)
>> seems OK when trying to do the assemble.  sdm1 seemed just fine before
>> the reboot, and was showing no problems before.  They are not hooked up
>> on the same controller cable ( a SAS to SATA fanout), and the LSI MPT
>> controller card seems to talk to the other disks just fine.
>>
>> Anyways,
>> I have no idea as to what's going on.  When I try to add sdm1 or sdl1
>> back into the array, md complains the device is busy, which is very odd
>> because it's not part of another array or doing anything else in the
>> system.
>>
>> Any idea as to what could be happening here?  I am beyond frustrated.
>>
>> thanks,
>> Mike
>>
>>
>>
>
> If you are using a hotswap chasis, then it has some sort of
> sata-backplane.  I have seen backplanes go bad in the past, that would be
> my first replacement.
>
> Justin.
>
>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html