Hi again! Thank you so much for this! Really cool. I patched the kernel changing the line as you said and now the reshape is now continuing. Holding my breath that it will finish but all looks ok so far. I needed to change the patch a little as it was for an older kernel so the lines changed a bit. This is what i used for linux 4.0.5: --- a/drivers/md/raid5.c 2015-06-08 23:05:02.808214213 +0200 +++ b/drivers/md/raid5.c 2015-06-08 23:05:47.601355604 +0200 @@ -3855,7 +3855,7 @@ */ if (s.failed > conf->max_degraded) { sh->check_state = 0; - sh->reconstruct_state = 0; + //sh->reconstruct_state = 0; if (s.to_read+s.to_write+s.written) handle_failed_stripe(conf, sh, &s, disks, &s.return_bi); if (s.syncing + s.replacing) Thank you again. :) / Vilhelm On Mon, Jun 8, 2015 at 9:31 AM, David Wahler <dwahler@xxxxxxxxx> wrote: > On Mon, Jun 8, 2015 at 1:19 AM, Vilhelm von Ehrenheim > <vonehrenheim@xxxxxxxxx> wrote: >> One thing that is strange and that seem to be connected to the reshape >> is this error, present in dmesg: >> >> [ 360.625322] INFO: task md0_reshape:126 blocked for more than 120 seconds. >> [ 360.625351] Not tainted 4.0.4-2-ARCH #1 >> [ 360.625367] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" >> disables this message. >> [ 360.625394] md0_reshape D ffff88040af57a58 0 126 >> 2 0x00000000 >> [ 360.625397] ffff88040af57a58 ffff88040cf58000 ffff8800da535b20 >> 00000001642a9888 >> [ 360.625399] ffff88040af57fd8 ffff8800da429000 ffff8800da429008 >> ffff8800da429208 >> [ 360.625401] 0000000096400e00 ffff88040af57a78 ffffffff81576707 >> ffff8800da429000 >> [ 360.625403] Call Trace: >> [ 360.625410] [<ffffffff81576707>] schedule+0x37/0x90 >> [ 360.625428] [<ffffffffa0120de9>] get_active_stripe+0x5c9/0x760 [raid456] >> [ 360.625432] [<ffffffff810b6c70>] ? wake_atomic_t_function+0x60/0x60 >> [ 360.625436] [<ffffffffa01246e0>] reshape_request+0x5b0/0x980 [raid456] >> [ 360.625439] [<ffffffff81579053>] ? schedule_timeout+0x123/0x250 >> [ 360.625443] [<ffffffffa011743f>] sync_request+0x28f/0x400 [raid456] >> [ 360.625449] [<ffffffffa00da486>] ? is_mddev_idle+0x136/0x170 [md_mod] >> [ 360.625454] [<ffffffffa00de4ba>] md_do_sync+0x8ba/0xe70 [md_mod] >> [ 360.625457] [<ffffffff81576002>] ? __schedule+0x362/0xa30 >> [ 360.625462] [<ffffffffa00d9e54>] md_thread+0x144/0x150 [md_mod] >> [ 360.625464] [<ffffffff810b6c70>] ? wake_atomic_t_function+0x60/0x60 >> [ 360.625468] [<ffffffffa00d9d10>] ? md_start_sync+0xf0/0xf0 [md_mod] >> [ 360.625471] [<ffffffff81093418>] kthread+0xd8/0xf0 >> [ 360.625473] [<ffffffff81093340>] ? kthread_worker_fn+0x170/0x170 >> [ 360.625476] [<ffffffff8157a398>] ret_from_fork+0x58/0x90 >> [ 360.625478] [<ffffffff81093340>] ? kthread_worker_fn+0x170/0x170 >> >> >> Also, looking at CPU usage md0_raid5 seems to be having problems as it >> is stuck on 100% CPU on one core: >> >> PID USER PR NI VIRT RES %CPU %MEM TIME+ S COMMAND >> 125 root 20 0 0.0m 0.0m 100.0 0.0 35:57.44 R `- md0_raid5 >> 126 root 20 0 0.0m 0.0m 0.0 0.0 0:00.06 D `- md0_reshape >> >> Could this be why the reshape has stopped? >> >> Can I do something to get it going again or Is it possible to revert >> to using 3 drives again without losing data? The data is not super >> important, hence no backup solution, but it would mean a lot of lost >> work. >> >> I'm thankful for any help I can get. Not sure what to do now. > > Hi Vilhelm, > > I ran into this exact situation several weeks ago. Fortunately Neil > Brown was able to track it down; it turns out that the reshape > operation can get stuck if it encounters bad blocks. See > http://article.gmane.org/gmane.linux.raid/48673 > > You can try applying the kernel patch from that message as a temporary > hack to allow the reshape to complete. It worked fine for me, aside > from a small amount of filesystem corruption that was fixable with > fsck. > > -- David -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html