Array died during grow; now resync stopped

Jörg Habenicht <j.habenicht@xxxxxx> · Mon, 2 Feb 2015 09:41:02 +0000 (UTC)

Hi all,

I had a server crash during an array grow.
Commandline was "mdadm --grow /dev/md0 --raid-devices=6 --chunk=1M"

Now the sync is stuck at 27% and wont continue.
$ cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4] 
md0 : active raid5 sde1[0] sdg1[9] sdc1[6] sdb1[7] sdd1[8] sdf1[5]
      5860548608 blocks super 1.0 level 5, 256k chunk, algorithm 2 [6/6]
[UUUUUU]
      [=====>...............]  reshape = 27.9% (410229760/1465137152)
finish=8670020128.0min speed=0K/sec

unused devices: <none>

$ mdadm -D /dev/md0
/dev/md0:
        Version : 1.0
  Creation Time : Thu Oct  7 09:28:04 2010
     Raid Level : raid5
     Array Size : 5860548608 (5589.05 GiB 6001.20 GB)
  Used Dev Size : 1465137152 (1397.26 GiB 1500.30 GB)
   Raid Devices : 6
  Total Devices : 6
    Persistence : Superblock is persistent

    Update Time : Sun Feb  1 13:30:05 2015
          State : clean, reshaping 
 Active Devices : 6
Working Devices : 6
 Failed Devices : 0
  Spare Devices : 0

         Layout : left-symmetric
     Chunk Size : 256K

 Reshape Status : 27% complete
  Delta Devices : 1, (5->6)
  New Chunksize : 1024K

           Name : stelli:3  (local to host stelli)
           UUID : 52857d77:3806e446:477d4865:d711451e
         Events : 2254869

    Number   Major   Minor   RaidDevice State
       0       8       65        0      active sync   /dev/sde1
       5       8       81        1      active sync   /dev/sdf1
       8       8       49        2      active sync   /dev/sdd1
       7       8       17        3      active sync   /dev/sdb1
       6       8       33        4      active sync   /dev/sdc1
       9       8       97        5      active sync   /dev/sdg1

smartcrl reports the disks are OK. No remapped sectors, no pending writes, etc.

The system load keeps at 2.0:
$ cat /proc/loadavg 
2.00 2.00 1.95 1/140 2937
which may be caused by udevd and md0_reshape
$ ps fax
  PID TTY      STAT   TIME COMMAND
    2 ?        S      0:00 [kthreadd]
...
 1671 ?        D      0:00  \_ [md0_reshape]
...
 1289 ?        Ss     0:01 /sbin/udevd --daemon
 1672 ?        D      0:00  \_ /sbin/udevd --daemon

Could this be caused by a software lock?

The system got 2G RAM and 2G swap. Is this sufficient to complete?
$ free
             total       used       free     shared    buffers     cached
Mem:       1799124     351808    1447316        540      14620     286216
-/+ buffers/cache:      50972    1748152
Swap:      2104508          0    2104508

And in dmesg I found this:
$ dmesg | less
[    5.456941] md: bind<sdg1>
[   11.015014] xor: measuring software checksum speed
[   11.051384]    prefetch64-sse:  3291.000 MB/sec
[   11.091375]    generic_sse:  3129.000 MB/sec
[   11.091378] xor: using function: prefetch64-sse (3291.000 MB/sec)
[   11.159365] raid6: sse2x1    1246 MB/s
[   11.227343] raid6: sse2x2    2044 MB/s
[   11.295327] raid6: sse2x4    2487 MB/s
[   11.295331] raid6: using algorithm sse2x4 (2487 MB/s)
[   11.295334] raid6: using intx1 recovery algorithm
[   11.328771] md: raid6 personality registered for level 6
[   11.328776] md: raid5 personality registered for level 5
[   11.328779] md: raid4 personality registered for level 4
[   19.840890] bio: create slab <bio-1> at 1
[  159.701406] md: md0 stopped.
[  159.701413] md: unbind<sdg1>
[  159.709902] md: export_rdev(sdg1)
[  159.709980] md: unbind<sdd1>
[  159.721856] md: export_rdev(sdd1)
[  159.721955] md: unbind<sdb1>
[  159.733883] md: export_rdev(sdb1)
[  159.733991] md: unbind<sdc1>
[  159.749856] md: export_rdev(sdc1)
[  159.749954] md: unbind<sdf1>
[  159.769885] md: export_rdev(sdf1)
[  159.769985] md: unbind<sde1>
[  159.781873] md: export_rdev(sde1)
[  160.471460] md: md0 stopped.
[  160.490329] md: bind<sdf1>
[  160.490478] md: bind<sdd1>
[  160.490689] md: bind<sdb1>
[  160.490911] md: bind<sdc1>
[  160.491164] md: bind<sdg1>
[  160.491408] md: bind<sde1>
[  160.492616] md/raid:md0: reshape will continue
[  160.492638] md/raid:md0: device sde1 operational as raid disk 0
[  160.492640] md/raid:md0: device sdg1 operational as raid disk 5
[  160.492641] md/raid:md0: device sdc1 operational as raid disk 4
[  160.492642] md/raid:md0: device sdb1 operational as raid disk 3
[  160.492644] md/raid:md0: device sdd1 operational as raid disk 2
[  160.492645] md/raid:md0: device sdf1 operational as raid disk 1
[  160.493187] md/raid:md0: allocated 0kB
[  160.493253] md/raid:md0: raid level 5 active with 6 out of 6 devices,
algorithm 2
[  160.493256] RAID conf printout:
[  160.493257]  --- level:5 rd:6 wd:6
[  160.493259]  disk 0, o:1, dev:sde1
[  160.493261]  disk 1, o:1, dev:sdf1
[  160.493262]  disk 2, o:1, dev:sdd1
[  160.493263]  disk 3, o:1, dev:sdb1
[  160.493264]  disk 4, o:1, dev:sdc1
[  160.493266]  disk 5, o:1, dev:sdg1
[  160.493336] md0: detected capacity change from 0 to 6001201774592
[  160.493340] md: reshape of RAID array md0
[  160.493342] md: minimum _guaranteed_  speed: 1000 KB/sec/disk.
[  160.493343] md: using maximum available idle IO bandwidth (but not more
than 200000 KB/sec) for reshape.
[  160.493351] md: using 128k window, over a total of 1465137152k.
[  160.951404]  md0: unknown partition table
[  190.984871] udevd[1289]: worker [1672] /devices/virtual/block/md0
timeout; kill it
[  190.984901] udevd[1289]: seq 2259 '/devices/virtual/block/md0' killed

$ mdadm --version
mdadm - v3.3.1 - 5th June 2014

uname -a
Linux XXXXX 3.14.14-gentoo #3 SMP Sat Jan 31 18:45:04 CET 2015 x86_64 AMD
Athlon(tm) II X2 240e Processor AuthenticAMD GNU/Linux

Currently I can't access the array to read the remaining data, nor can I
continue the array grow.
Can you help me get it running?

best regards 
Jörg

��.n��������+%������w��{.n�����{����w��ܨ}���Ơz�j:+v�����w����ޙ��&�)ߡ�a����z�ޗ���ݢj��w�f