Re: some ?? re failed disk and resyncing of array

whollygoat@xxxxxxxxxxxxxxx · Sat, 31 Jan 2009 04:03:08 -0800

On Sat, 31 Jan 2009 10:38:22 +0000, "David Greaves" <david@xxxxxxxxxxxx>
said:
> whollygoat@xxxxxxxxxxxxxxx wrote:
> > On a boot a couple of days ago, mdadm failed a disk and
> > started resyncing to spare (raid5, 6 drives, 5 active, 1
> > spare).  smartctl -H <disk> returned info (can't remember
> > the exact text) that made me suspect the drive was
> > fine, but the data connection was bad.  Sure enough the
> > data cable was damaged.  Replaced the cable and smartctl
> > sees the disk just fine and reports no errors.
> > 
> > - I'd like to readd the drive as a spare.  Is it enough
> > to "mdadm --add /dev/hdk" or do I need to prep the drive to
> > remove any data that said where it previously belonged
> > in the array?
> That should work.
> Any issues and you can zero the superblock (man mdadm)
> No need to zero the disk.

Would --re-add be better?

I've noticed something else since I made the initial post

--------- begin output -------------
fly:~# mdadm -D /dev/md0
/dev/md0:
        Version : 01.00.03
  Creation Time : Sun Jan 11 21:49:36 2009
     Raid Level : raid5
     Array Size : 312602368 (298.12 GiB 320.10 GB)
    Device Size : 156301184 (74.53 GiB 80.03 GB)
   Raid Devices : 5
  Total Devices : 5
Preferred Minor : 0
    Persistence : Superblock is persistent

  Intent Bitmap : Internal

    Update Time : Fri Jan 30 15:52:01 2009
          State : active
 Active Devices : 5
Working Devices : 5
 Failed Devices : 0
  Spare Devices : 0

         Layout : left-symmetric
     Chunk Size : 64K

           Name : fly:FlyFileServ_md  (local to host fly)
           UUID : 0e2b9157:a58edc1d:213a220f:68a555c9
         Events : 16

    Number   Major   Minor   RaidDevice State
       0      33        1        0      active sync   /dev/hde1
       1      34        1        1      active sync   /dev/hdg1
       2      56        1        2      active sync   /dev/hdi1
       5      89        1        3      active sync   /dev/hdo1
       6      88        1        4      active sync   /dev/hdm1

fly:~# mdadm -E /dev/hdo1
/dev/hdo1:
          Magic : a92b4efc
        Version : 01
    Feature Map : 0x1
     Array UUID : 0e2b9157:a58edc1d:213a220f:68a555c9
           Name : fly:FlyFileServ_md  (local to host fly)
  Creation Time : Sun Jan 11 21:49:36 2009
     Raid Level : raid5
   Raid Devices : 5

    Device Size : 234436336 (111.79 GiB 120.03 GB)
     Array Size : 625204736 (298.12 GiB 320.10 GB)
      Used Size : 156301184 (74.53 GiB 80.03 GB)
   Super Offset : 234436464 sectors
          State : clean
    Device UUID : e072bd09:2df53d6d:d23321cc:cf2c37de

Internal Bitmap : 2 sectors from superblock
    Update Time : Fri Jan 30 15:52:01 2009
       Checksum : 4689ff5 - correct
         Events : 16

         Layout : left-symmetric
     Chunk Size : 64K

    Array Slot : 5 (0, 1, 2, failed, failed, 3, 4)
   Array State : uuuUu 2 failed
--------- end output -------------

Why does the "Array Slot" field show 7 slots?  And why
does the field "Array State" show 2 failed?  There 
ever only were 6 disks in the array.  Only one of those
is currently missing.  mdadm -D above doesn't list any
failed devices in the "Failed Devices" field.

Thanks for your answers below as well.  It's kind of 
what I was expecting.  There was a h/w problem that
took ages to track down and I think it was reponsible
for all the e2fs errors.

WG

> 
> > - When I tried to list some files on one of the filesystems
> > on the array (the fact that it took so long to react to
> > the ls is how I discovered the box was in the middle of
> > rebuiling to spare)
> This is OK - resync involves a lot of IO and can slow things down. This
> is tuneable.
> 
> > it couldn't find the file (or many 
> > others).  I thought that resyncing was supposed to be
> > transparent, yet parts of the fs seemed to be missing.
> > Everything was there afterwards.  Is that normal?
> No. This is nothing to do with normal md resyncing and certainly not
> expected.
> 
> > - On a subsequent boot I had to run e2fsck on the three
> > filesystems housed on the array.  Many stray blocks, 
> > illegal inodes, etc were found.  An artifact of the rebuild
> > or unrelated?
> Well, you had a fault in your IO system there's a good chance your O
> broke.
> 
> Verify against a backup.
> 
> David
> 
> 
> -- 
> "Don't worry, you'll be fine; I saw it work in a cartoon once..."
-- 

  whollygoat@xxxxxxxxxxxxxxx

-- 
http://www.fastmail.fm - IMAP accessible web-mail

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html