On Mon, 2 Feb 2015 09:41:02 +0000 (UTC) Jörg Habenicht <j.habenicht@xxxxxx> wrote: > Hi all, > > I had a server crash during an array grow. > Commandline was "mdadm --grow /dev/md0 --raid-devices=6 --chunk=1M" > > Now the sync is stuck at 27% and wont continue. > $ cat /proc/mdstat > Personalities : [raid6] [raid5] [raid4] > md0 : active raid5 sde1[0] sdg1[9] sdc1[6] sdb1[7] sdd1[8] sdf1[5] > 5860548608 blocks super 1.0 level 5, 256k chunk, algorithm 2 [6/6] > [UUUUUU] > [=====>...............] reshape = 27.9% (410229760/1465137152) > finish=8670020128.0min speed=0K/sec > > unused devices: <none> > > > $ mdadm -D /dev/md0 > /dev/md0: > Version : 1.0 > Creation Time : Thu Oct 7 09:28:04 2010 > Raid Level : raid5 > Array Size : 5860548608 (5589.05 GiB 6001.20 GB) > Used Dev Size : 1465137152 (1397.26 GiB 1500.30 GB) > Raid Devices : 6 > Total Devices : 6 > Persistence : Superblock is persistent > > Update Time : Sun Feb 1 13:30:05 2015 > State : clean, reshaping > Active Devices : 6 > Working Devices : 6 > Failed Devices : 0 > Spare Devices : 0 > > Layout : left-symmetric > Chunk Size : 256K > > Reshape Status : 27% complete > Delta Devices : 1, (5->6) > New Chunksize : 1024K > > Name : stelli:3 (local to host stelli) > UUID : 52857d77:3806e446:477d4865:d711451e > Events : 2254869 > > Number Major Minor RaidDevice State > 0 8 65 0 active sync /dev/sde1 > 5 8 81 1 active sync /dev/sdf1 > 8 8 49 2 active sync /dev/sdd1 > 7 8 17 3 active sync /dev/sdb1 > 6 8 33 4 active sync /dev/sdc1 > 9 8 97 5 active sync /dev/sdg1 > > > smartcrl reports the disks are OK. No remapped sectors, no pending writes, etc. > > The system load keeps at 2.0: > $ cat /proc/loadavg > 2.00 2.00 1.95 1/140 2937 > which may be caused by udevd and md0_reshape > $ ps fax > PID TTY STAT TIME COMMAND > 2 ? S 0:00 [kthreadd] > ... > 1671 ? D 0:00 \_ [md0_reshape] > ... > 1289 ? Ss 0:01 /sbin/udevd --daemon > 1672 ? D 0:00 \_ /sbin/udevd --daemon > > > Could this be caused by a software lock? Some sort of software problem I suspect. What does cat /proc/1671/stack cat /proc/1672/stack show? Alternatively, echo w > /proc/sysrq-trigger and see what appears in 'dmesg'. > > The system got 2G RAM and 2G swap. Is this sufficient to complete? Memory shouldn't be a problem. However it wouldn't hurt to see what value is in /sys/block/md0/md/stripe_cache_size and double it. If all else fails a reboot should be safe and will probably start the reshape properly. md is very careful about surviving reboots. NeilBrown > $ free > total used free shared buffers cached > Mem: 1799124 351808 1447316 540 14620 286216 > -/+ buffers/cache: 50972 1748152 > Swap: 2104508 0 2104508 > > > And in dmesg I found this: > $ dmesg | less > [ 5.456941] md: bind<sdg1> > [ 11.015014] xor: measuring software checksum speed > [ 11.051384] prefetch64-sse: 3291.000 MB/sec > [ 11.091375] generic_sse: 3129.000 MB/sec > [ 11.091378] xor: using function: prefetch64-sse (3291.000 MB/sec) > [ 11.159365] raid6: sse2x1 1246 MB/s > [ 11.227343] raid6: sse2x2 2044 MB/s > [ 11.295327] raid6: sse2x4 2487 MB/s > [ 11.295331] raid6: using algorithm sse2x4 (2487 MB/s) > [ 11.295334] raid6: using intx1 recovery algorithm > [ 11.328771] md: raid6 personality registered for level 6 > [ 11.328776] md: raid5 personality registered for level 5 > [ 11.328779] md: raid4 personality registered for level 4 > [ 19.840890] bio: create slab <bio-1> at 1 > [ 159.701406] md: md0 stopped. > [ 159.701413] md: unbind<sdg1> > [ 159.709902] md: export_rdev(sdg1) > [ 159.709980] md: unbind<sdd1> > [ 159.721856] md: export_rdev(sdd1) > [ 159.721955] md: unbind<sdb1> > [ 159.733883] md: export_rdev(sdb1) > [ 159.733991] md: unbind<sdc1> > [ 159.749856] md: export_rdev(sdc1) > [ 159.749954] md: unbind<sdf1> > [ 159.769885] md: export_rdev(sdf1) > [ 159.769985] md: unbind<sde1> > [ 159.781873] md: export_rdev(sde1) > [ 160.471460] md: md0 stopped. > [ 160.490329] md: bind<sdf1> > [ 160.490478] md: bind<sdd1> > [ 160.490689] md: bind<sdb1> > [ 160.490911] md: bind<sdc1> > [ 160.491164] md: bind<sdg1> > [ 160.491408] md: bind<sde1> > [ 160.492616] md/raid:md0: reshape will continue > [ 160.492638] md/raid:md0: device sde1 operational as raid disk 0 > [ 160.492640] md/raid:md0: device sdg1 operational as raid disk 5 > [ 160.492641] md/raid:md0: device sdc1 operational as raid disk 4 > [ 160.492642] md/raid:md0: device sdb1 operational as raid disk 3 > [ 160.492644] md/raid:md0: device sdd1 operational as raid disk 2 > [ 160.492645] md/raid:md0: device sdf1 operational as raid disk 1 > [ 160.493187] md/raid:md0: allocated 0kB > [ 160.493253] md/raid:md0: raid level 5 active with 6 out of 6 devices, > algorithm 2 > [ 160.493256] RAID conf printout: > [ 160.493257] --- level:5 rd:6 wd:6 > [ 160.493259] disk 0, o:1, dev:sde1 > [ 160.493261] disk 1, o:1, dev:sdf1 > [ 160.493262] disk 2, o:1, dev:sdd1 > [ 160.493263] disk 3, o:1, dev:sdb1 > [ 160.493264] disk 4, o:1, dev:sdc1 > [ 160.493266] disk 5, o:1, dev:sdg1 > [ 160.493336] md0: detected capacity change from 0 to 6001201774592 > [ 160.493340] md: reshape of RAID array md0 > [ 160.493342] md: minimum _guaranteed_ speed: 1000 KB/sec/disk. > [ 160.493343] md: using maximum available idle IO bandwidth (but not more > than 200000 KB/sec) for reshape. > [ 160.493351] md: using 128k window, over a total of 1465137152k. > [ 160.951404] md0: unknown partition table > [ 190.984871] udevd[1289]: worker [1672] /devices/virtual/block/md0 > timeout; kill it > [ 190.984901] udevd[1289]: seq 2259 '/devices/virtual/block/md0' killed > > > > $ mdadm --version > mdadm - v3.3.1 - 5th June 2014 > > uname -a > Linux XXXXX 3.14.14-gentoo #3 SMP Sat Jan 31 18:45:04 CET 2015 x86_64 AMD > Athlon(tm) II X2 240e Processor AuthenticAMD GNU/Linux > > > Currently I can't access the array to read the remaining data, nor can I > continue the array grow. > Can you help me get it running? > > > best regards > Jörg > > >
Attachment:
pgprZatGyI8BA.pgp
Description: OpenPGP digital signature