Hi Neil and Shaohua I'm trying to reproduce the problem the problem "raid5_finish_reshape is stuck at revalidate_disk". But there is a new stuck. I tried with 4.13.0-rc5 and latest mdadm. The steps are: #!/bin/bash num=0 while [ 1 ] do echo "*************************$num" mdadm -Ss mdadm --create --run /dev/md0 --level 5 --metadata 1.2 --raid-devices 5 /dev/loop0 /dev/loop1 /dev/loop2 /dev/loop3 /dev/loop4 --spare-devices 1 /dev/loop5 --chunk 512 mdadm --wait /dev/md0 mkfs -t ext4 /dev/md0 mount -t ext4 /dev/md0 /mnt/md_test dd if=/dev/urandom of=/mnt/md_test/testfile bs=1M count=100 mdadm --add /dev/md0 /dev/loop6 mdadm --grow --raid-devices 6 /dev/md0 dd if=/dev/urandom of=/mnt/md_test/testfile bs=1M count=1000 mdadm --wait /dev/md0 ((num++)) umount /dev/md0 done The calltrace messages are: Sep 1 13:57:25 localhost kernel: INFO: task kworker/u8:4:21401 blocked for more than 120 seconds. Sep 1 13:57:25 localhost kernel: Tainted: G OE 4.13.0-rc5 #2 Sep 1 13:57:25 localhost kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Sep 1 13:57:25 localhost kernel: kworker/u8:4 D 0 21401 2 0x00000080 Sep 1 13:57:25 localhost kernel: Workqueue: writeback wb_workfn (flush-9:0) Sep 1 13:57:25 localhost kernel: Call Trace: Sep 1 13:57:25 localhost kernel: __schedule+0x28d/0x890 Sep 1 13:57:25 localhost kernel: schedule+0x36/0x80 Sep 1 13:57:25 localhost kernel: md_write_start+0xf0/0x220 [md_mod] Sep 1 13:57:25 localhost kernel: ? remove_wait_queue+0x60/0x60 Sep 1 13:57:25 localhost kernel: raid5_make_request+0x89/0x8b0 [raid456] Sep 1 13:57:25 localhost kernel: ? bio_split+0x5d/0x90 Sep 1 13:57:25 localhost kernel: ? blk_queue_split+0xd2/0x630 Sep 1 13:57:25 localhost kernel: ? remove_wait_queue+0x60/0x60 Sep 1 13:57:25 localhost kernel: md_make_request+0xf5/0x260 [md_mod] .... All the processes send bio to md are stuck at md_write_start. wait_event(mddev->sb_wait, !test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags) && !mddev->suspended); Sep 1 13:57:26 localhost kernel: INFO: task md0_reshape:23605 blocked for more than 120 seconds. Sep 1 13:57:26 localhost kernel: Tainted: G OE 4.13.0-rc5 #2 Sep 1 13:57:26 localhost kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Sep 1 13:57:26 localhost kernel: md0_reshape D 0 23605 2 0x00000080 Sep 1 13:57:26 localhost kernel: Call Trace: Sep 1 13:57:26 localhost kernel: __schedule+0x28d/0x890 Sep 1 13:57:26 localhost kernel: schedule+0x36/0x80 Sep 1 13:57:26 localhost kernel: raid5_sync_request+0x2cf/0x370 [raid456] Sep 1 13:57:26 localhost kernel: ? remove_wait_queue+0x60/0x60 Sep 1 13:57:26 localhost kernel: md_do_sync+0xafe/0xee0 [md_mod] Sep 1 13:57:26 localhost kernel: ? remove_wait_queue+0x60/0x60 Sep 1 13:57:26 localhost kernel: md_thread+0x132/0x180 [md_mod] Sep 1 13:57:26 localhost kernel: kthread+0x109/0x140 Sep 1 13:57:26 localhost kernel: ? find_pers+0x70/0x70 [md_mod] Sep 1 13:57:26 localhost kernel: ? kthread_park+0x60/0x60 Sep 1 13:57:26 localhost kernel: ? do_syscall_64+0x67/0x150 Sep 1 13:57:26 localhost kernel: ret_from_fork+0x25/0x30 It stucks at: /* Allow raid5_quiesce to complete */ wait_event(conf->wait_for_overlap, conf->quiesce != 2); Sep 1 13:57:26 localhost kernel: INFO: task mdadm:23613 blocked for more than 120 seconds. Sep 1 13:57:26 localhost kernel: Tainted: G OE 4.13.0-rc5 #2 Sep 1 13:57:26 localhost kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Sep 1 13:57:26 localhost kernel: mdadm D 0 23613 1 0x00000080 Sep 1 13:57:26 localhost kernel: Call Trace: Sep 1 13:57:26 localhost kernel: __schedule+0x28d/0x890 Sep 1 13:57:26 localhost kernel: schedule+0x36/0x80 Sep 1 13:57:26 localhost kernel: raid5_quiesce+0x274/0x2b0 [raid456] Sep 1 13:57:26 localhost kernel: ? remove_wait_queue+0x60/0x60 Sep 1 13:57:26 localhost kernel: suspend_lo_store+0x82/0xe0 [md_mod] Sep 1 13:57:26 localhost kernel: md_attr_store+0x80/0xc0 [md_mod] Sep 1 13:57:26 localhost kernel: sysfs_kf_write+0x3a/0x50 Sep 1 13:57:26 localhost kernel: kernfs_fop_write+0xff/0x180 Sep 1 13:57:26 localhost kernel: __vfs_write+0x37/0x170 Sep 1 13:57:26 localhost kernel: ? selinux_file_permission+0xe5/0x120 Sep 1 13:57:26 localhost kernel: ? security_file_permission+0x3b/0xc0 Sep 1 13:57:26 localhost kernel: vfs_write+0xb2/0x1b0 Sep 1 13:57:26 localhost kernel: ? syscall_trace_enter+0x1d0/0x2b0 Sep 1 13:57:26 localhost kernel: SyS_write+0x55/0xc0 Sep 1 13:57:26 localhost kernel: do_syscall_64+0x67/0x150 It's stuck at: conf->quiesce = 2; wait_event_cmd(conf->wait_for_quiescent, atomic_read(&conf->active_stripes) == 0 && atomic_read(&conf->active_aligned_reads) == 0, unlock_all_device_hash_locks_irq(conf), lock_all_device_hash_locks_irq(conf)); conf->quiesce = 1; [root@dell-pr1700-02 ~]# cat /proc/mdstat Personalities : [raid6] [raid5] [raid4] md0 : active raid5 loop6[7] loop4[6] loop5[5](S) loop3[3] loop2[2] loop1[1] loop0[0] 2039808 blocks super 1.2 level 5, 512k chunk, algorithm 2 [6/6] [UUUUUU] [>....................] reshape = 0.5% (2560/509952) finish=162985.0min speed=0K/sec unused devices: <none> I can reproduce this. So I add some logs by printk to check MD_SB_CHANGE_PENDING of mddev->flags and mddev->suspended. mddev->suspend : 0 mddev->flags : SP CHANGE PENDING is set conf->quiesce : 2 The I echo active > /sys/block/md0/md/array_state, reshape can start and finish successfully. I notice there are some fixes for raid5 stuck problem. Not sure whether it's introduced by those patches. Best Regards Xiao -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html