Re: raid5 reshape/resync

Nagilum <nagilum@xxxxxxxxxxx> · Sat, 01 Dec 2007 15:48:17 +0100

----- Message from neilb@xxxxxxx ---------
    Date: Thu, 29 Nov 2007 16:48:47 +1100
    From: Neil Brown <neilb@xxxxxxx>
Reply-To: Neil Brown <neilb@xxxxxxx>
 Subject: Re: raid5 reshape/resync
      To: Nagilum <nagilum@xxxxxxxxxxx>
      Cc: linux-raid@xxxxxxxxxxxxxxx

> Hi,
> I'm running 2.6.23.8 x86_64 using mdadm v2.6.4.
> I was adding a disk (/dev/sdf) to an existing raid5 (/dev/sd[a-e] -> md0)
> During that reshape (at around 4%) /dev/sdd reported read errors and
> went offline.

Sad.

> I replaced /dev/sdd with a new drive and tried to reassemble the array
> (/dev/sdd was shown as removed and now as spare).

There must be a step missing here.
Just because one drive goes offline, that  doesn't mean that you need
to reassemble the array.  It should just continue with the reshape
until that is finished.  Did you shut the machine down or did it crash
or what
> Assembly worked but it would not run unless I use --force.

That suggests an unclean shutdown.  Maybe it did crash?

I started the reshape and went out. When I came back the controller  
was beeping (indicating the erraneous disk). I tried to log on but I  
could not get in. The machine was responding to pings but that was  
about it (no ssh or xdm login worked). So I hard rebooted.
I booted into a rescue root, the /etc/mdadm/mdadm.conf didn't yet  
include the new disk so the raid was missing one disk and not started.
Since I didn't know what exactly what was going on I --re-added sdf  
(the new disk) and tried to resume reshaping. A second into that the  
read failure on /dev/sdd was reported. So I stopped md0 and shut down  
to verify the read error with another controller.
After I had verified that I replaced /dev/sdd with a new drive and put  
in the broken drive as /dev/sdg, just in case.

> Since I'm always reluctant to use force I put the bad disk back in,
> this time as /dev/sdg . I re-added the drive and could run the array.
> The array started to resync (since the disk can be read until 4%) and
> then I marked the disk as failed. Now the array is "active, degraded,
> recovering":

It should have restarted the reshape from whereever it was up to, so
it should have hit the read error almost immediately.  Do you remember
where it started the reshape from?  If it restarted from the beginning
that would be bad.

It must have continued where it left off since the reshape position in  
all superblocks was at about 4%.

Did you just "--assemble" all the drives or did you do something else?

Sorry for being a bit unexact here, I didn't actually have to use  
--assemble, when booting into the rescue root the raid came up with  
/dev/sdd and /dev/sdf removed. I just had to --re-add /dev/sdf

> unusually low which seems to indicate a lot of seeking as if two
> operations are happening at the same time.

Well reshape is always slow as it has to read from one part of the
drive and write to another part of the drive.

Actually it was resyncing with the minimum speed, I managed to crank  
up the speed to >20MB/s by adjusting /sys/block/md0/md/sync_speed_min

> Can someone relief my doubts as to whether md does the right thing here?
> Thanks,

I believe it is do "the right thing".

>
----- End message from nagilum@xxxxxxxxxxx -----

Ok, so the reshape tried to continue without the failed drive and
after that resynced to the new spare.

As I would expect.

Unfortunately the result is a mess. On top of the Raid5 I have

Hmm.  This I would not expect.

dm-crypt and LVM.
Although dmcrypt and LVM dont appear to have a problem the filesystems
on top are a mess now.

Can you be more specific about what sort of "mess" they are in?

Sure.
So here is the vg-layout:
nas:~# lvdisplay vg01
  --- Logical volume ---
  LV Name                /dev/vg01/lv1
  VG Name                vg01
  LV UUID                4HmzU2-VQpO-vy5R-Wdys-PmwH-AuUg-W02CKS
  LV Write Access        read/write
  LV Status              available
  # open                 0
  LV Size                512.00 MB
  Current LE             128
  Segments               1
  Allocation             inherit
  Read ahead sectors     0
  Block device           253:1

  --- Logical volume ---
  LV Name                /dev/vg01/lv2
  VG Name                vg01
  LV UUID                4e2ZB9-29Rb-dy4M-EzEY-cEIG-Nm1I-CPI0kk
  LV Write Access        read/write
  LV Status              available
  # open                 0
  LV Size                7.81 GB
  Current LE             2000
  Segments               1
  Allocation             inherit
  Read ahead sectors     0
  Block device           253:2

  --- Logical volume ---
  LV Name                /dev/vg01/lv3
  VG Name                vg01
  LV UUID                YQRd0X-5hF8-2dd3-GG4v-wQLH-WGH0-ntGgug
  LV Write Access        read/write
  LV Status              available
  # open                 0
  LV Size                1.81 TB
  Current LE             474735
  Segments               1
  Allocation             inherit
  Read ahead sectors     0
  Block device           253:3

The layout was created like that and except for increasing the size of  
the lv3 never changed anything. Therefore I think its safe to assume  
they are located in order and without gaps. The first lv is swap, so  
not much to loose here, the second lv is "/" reiserfs and is fine too.  
The third lv however looks pretty bad.
I uploaded the "xfs_repair -n /dev/mapper/vg01-lv3" output to  
http://www.nagilum.org/md/xfs_repair.txt.
I can mount the filesystem but the directories all look like that:

drwxr-xr-x 16 nagilum nagilum  155 2007-09-18 18:20 .
drwxr-xr-x  5 nagilum nagilum   89 2007-09-22 17:56 ..
drwxr-xr-x 12 nagilum nagilum  121 2007-09-18 18:19 biz
?---------  ? ?       ?          ?                ? comm
?---------  ? ?       ?          ?                ? dev
drwxr-xr-x  8 nagilum nagilum   76 2007-09-18 18:19 disk
drwxr-xr-x  7 nagilum nagilum   64 2007-09-18 18:19 docs
?---------  ? ?       ?          ?                ? game
?---------  ? ?       ?          ?                ? gfx
drwxr-xr-x  5 nagilum nagilum   40 2007-09-18 18:20 hard
drwxr-xr-x  8 nagilum nagilum   69 2007-09-18 18:20 misc
drwxr-xr-x  4 nagilum nagilum   27 2007-09-18 18:20 mods
drwxr-xr-x  5 nagilum nagilum   39 2007-09-18 18:20 mus
?---------  ? ?       ?          ?                ? pix
drwxr-xr-x  6 nagilum nagilum   51 2007-09-18 18:20 text
drwxr-xr-x 22 nagilum nagilum 4096 2007-09-18 18:21 util

Also the files which are readable are corrupt.
It looks to me as if md mixed up the chunk order in the stripes past  
the 4% mark.
I looked at a larger textfile to see what kind damage was done and see  
that it starts out ok but at 0xd000 the data becomes random data until  
0x11000.
Maybe a table to simplify things:
Ok     0x0     - 0xd000
Random 0xd000  - 0x11000
Ok     0x11000 - 0x21000
Random 0x21000 - 0x25000
Ok     0x25000 - 0x35000
Random 0x35000 - 0x39000

And so on.. 0x4000 is equal to my chunk size.
 Since LUKS uses the sectornumber for whitening the "random data"  
must be wrongly decrypted text.
I'm not sure how to reorder things so it will be ok again, I'll ponder  
about that while I try to recreate the situation using files and  
losetup.
And finally the information from the failed drive:

nas:~# mdadm -E /dev/sdg
/dev/sdg:
          Magic : a92b4efc
        Version : 00.91.00
           UUID : 25da80a6:d56eb9d6:0d7656f3:2f233380
  Creation Time : Sat Sep 15 21:11:41 2007
     Raid Level : raid5
  Used Dev Size : 488308672 (465.69 GiB 500.03 GB)
     Array Size : 2441543360 (2328.44 GiB 2500.14 GB)
   Raid Devices : 6
  Total Devices : 7
Preferred Minor : 0

  Reshape pos'n : 118360960 (112.88 GiB 121.20 GB)
  Delta Devices : 1 (5->6)

    Update Time : Fri Nov 23 20:05:50 2007
          State : active
 Active Devices : 6
Working Devices : 7
 Failed Devices : 0
  Spare Devices : 1
       Checksum : 9a8358c4 - correct
         Events : 0.677965

         Layout : left-symmetric
     Chunk Size : 16K

      Number   Major   Minor   RaidDevice State
this     3       8       96        3      active sync   /dev/sdg

   0     0       8        0        0      active sync   /dev/sda
   1     1       8       16        1      active sync   /dev/sdb
   2     2       8       32        2      active sync   /dev/sdc
   3     3       8       96        3      active sync   /dev/sdg
   4     4       8       64        4      active sync   /dev/sde
   5     5       8       80        5      active sync   /dev/sdf
   6     6       8       48        6      spare   /dev/sdd

from md's point of view the array is "fine" now of course:

nas:~# mdadm -Q --detail /dev/md0
/dev/md0:
        Version : 00.90.03
  Creation Time : Sat Sep 15 21:11:41 2007
     Raid Level : raid5
     Array Size : 2441543360 (2328.44 GiB 2500.14 GB)
  Used Dev Size : 488308672 (465.69 GiB 500.03 GB)
   Raid Devices : 6
  Total Devices : 6
Preferred Minor : 0
    Persistence : Superblock is persistent

    Update Time : Sat Dec  1 15:25:59 2007
          State : clean
 Active Devices : 6
Working Devices : 6
 Failed Devices : 0
  Spare Devices : 0

         Layout : left-symmetric
     Chunk Size : 16K

           UUID : 25da80a6:d56eb9d6:0d7656f3:2f233380
         Events : 0.986918

    Number   Major   Minor   RaidDevice State
       0       8        0        0      active sync   /dev/sda
       1       8       16        1      active sync   /dev/sdb
       2       8       32        2      active sync   /dev/sdc
       3       8       48        3      active sync   /dev/sdd
       4       8       64        4      active sync   /dev/sde
       5       8       80        5      active sync   /dev/sdf

Ok, enough for now, any useful ideas are greatly appreciated!
Alex.

========================================================================
#    _  __          _ __     http://www.nagilum.org/ \n icq://69646724 #
#   / |/ /__ ____ _(_) /_ ____ _  nagilum@xxxxxxxxxxx \n +491776461165 #
#  /    / _ `/ _ `/ / / // /  ' \  Amiga (68k/PPC): AOS/NetBSD/Linux   #
# /_/|_/\_,_/\_, /_/_/\_,_/_/_/_/   Mac (PPC): MacOS-X / NetBSD /Linux #
#           /___/     x86: FreeBSD/Linux/Solaris/Win2k  ARM9: EPOC EV6 #
========================================================================

----------------------------------------------------------------
cakebox.homeunix.net - all the machine one needs..

Attachment:
pgpEh6mhIv0vX.pgp

Description: PGP Digital Signature