Re: Stuck in md_write_start because MD_SB_CHANGE_PENDING can't be cleared

NeilBrown <neilb@xxxxxxxx> · Mon, 04 Sep 2017 15:34:57 +1000

On Mon, Sep 04 2017, Xiao Ni wrote:

> On 09/04/2017 10:45 AM, Xiao Ni wrote:
>>
>>
>> On 09/04/2017 10:16 AM, NeilBrown wrote:
>>> On Sat, Sep 02 2017, Xiao Ni wrote:
>>>
>>>> Hi Neil and Shaohua
>>>>
>>>> I'm trying to reproduce the problem the problem 
>>>> "raid5_finish_reshape is stuck at revalidate_disk".
>>>> But there is a new stuck. I tried with 4.13.0-rc5 and latest mdadm. 
>>>> The steps are:
>>>>
>>>> #!/bin/bash
>>>> num=0
>>>> while [ 1 ]
>>>> do
>>>>     echo "*************************$num"
>>>>     mdadm -Ss
>>>>     mdadm --create --run /dev/md0 --level 5 --metadata 1.2 
>>>> --raid-devices 5 /dev/loop0 /dev/loop1 /dev/loop2 /dev/loop3 
>>>> /dev/loop4 --spare-devices 1 /dev/loop5 --chunk 512
>>>>     mdadm --wait /dev/md0
>>>>     mkfs -t ext4 /dev/md0
>>>>     mount -t ext4 /dev/md0 /mnt/md_test
>>>>     dd if=/dev/urandom of=/mnt/md_test/testfile bs=1M count=100
>>>>     mdadm --add /dev/md0 /dev/loop6
>>>>     mdadm --grow --raid-devices 6 /dev/md0
>>>>     dd if=/dev/urandom of=/mnt/md_test/testfile bs=1M count=1000
>>>>     mdadm --wait /dev/md0
>>>>     ((num++))
>>>>     umount /dev/md0
>>>> done
>>>>
>>>> The calltrace messages are:
>>>> Sep  1 13:57:25 localhost kernel: INFO: task kworker/u8:4:21401 
>>>> blocked for more than 120 seconds.
>>>> Sep  1 13:57:25 localhost kernel:      Tainted: G OE   4.13.0-rc5 #2
>>>> Sep  1 13:57:25 localhost kernel: "echo 0 > 
>>>> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>>>> Sep  1 13:57:25 localhost kernel: kworker/u8:4    D    0 21401      
>>>> 2 0x00000080
>>>> Sep  1 13:57:25 localhost kernel: Workqueue: writeback wb_workfn 
>>>> (flush-9:0)
>>>> Sep  1 13:57:25 localhost kernel: Call Trace:
>>>> Sep  1 13:57:25 localhost kernel: __schedule+0x28d/0x890
>>>> Sep  1 13:57:25 localhost kernel: schedule+0x36/0x80
>>>> Sep  1 13:57:25 localhost kernel: md_write_start+0xf0/0x220 [md_mod]
>>>> Sep  1 13:57:25 localhost kernel: ? remove_wait_queue+0x60/0x60
>>>> Sep  1 13:57:25 localhost kernel: raid5_make_request+0x89/0x8b0 
>>>> [raid456]
>>>> Sep  1 13:57:25 localhost kernel: ? bio_split+0x5d/0x90
>>>> Sep  1 13:57:25 localhost kernel: ? blk_queue_split+0xd2/0x630
>>>> Sep  1 13:57:25 localhost kernel: ? remove_wait_queue+0x60/0x60
>>>> Sep  1 13:57:25 localhost kernel: md_make_request+0xf5/0x260 [md_mod]
>>>> ....
>>>>
>>>> All the processes send bio to md are stuck at md_write_start.
>>>>          wait_event(mddev->sb_wait,
>>>>                     !test_bit(MD_SB_CHANGE_PENDING, 
>>>> &mddev->sb_flags) && !mddev->suspended);
>>>>
>>>>
>>>> Sep  1 13:57:26 localhost kernel: INFO: task md0_reshape:23605 
>>>> blocked for more than 120 seconds.
>>>> Sep  1 13:57:26 localhost kernel:      Tainted: G OE   4.13.0-rc5 #2
>>>> Sep  1 13:57:26 localhost kernel: "echo 0 > 
>>>> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>>>> Sep  1 13:57:26 localhost kernel: md0_reshape     D    0 23605      
>>>> 2 0x00000080
>>>> Sep  1 13:57:26 localhost kernel: Call Trace:
>>>> Sep  1 13:57:26 localhost kernel: __schedule+0x28d/0x890
>>>> Sep  1 13:57:26 localhost kernel: schedule+0x36/0x80
>>>> Sep  1 13:57:26 localhost kernel: raid5_sync_request+0x2cf/0x370 
>>>> [raid456]
>>>> Sep  1 13:57:26 localhost kernel: ? remove_wait_queue+0x60/0x60
>>>> Sep  1 13:57:26 localhost kernel: md_do_sync+0xafe/0xee0 [md_mod]
>>>> Sep  1 13:57:26 localhost kernel: ? remove_wait_queue+0x60/0x60
>>>> Sep  1 13:57:26 localhost kernel: md_thread+0x132/0x180 [md_mod]
>>>> Sep  1 13:57:26 localhost kernel: kthread+0x109/0x140
>>>> Sep  1 13:57:26 localhost kernel: ? find_pers+0x70/0x70 [md_mod]
>>>> Sep  1 13:57:26 localhost kernel: ? kthread_park+0x60/0x60
>>>> Sep  1 13:57:26 localhost kernel: ? do_syscall_64+0x67/0x150
>>>> Sep  1 13:57:26 localhost kernel: ret_from_fork+0x25/0x30
>>>>
>>>> It stucks at:
>>>>          /* Allow raid5_quiesce to complete */
>>>>          wait_event(conf->wait_for_overlap, conf->quiesce != 2);
>>>>
>>>> Sep  1 13:57:26 localhost kernel: INFO: task mdadm:23613 blocked for 
>>>> more than 120 seconds.
>>>> Sep  1 13:57:26 localhost kernel:      Tainted: G OE   4.13.0-rc5 #2
>>>> Sep  1 13:57:26 localhost kernel: "echo 0 > 
>>>> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>>>> Sep  1 13:57:26 localhost kernel: mdadm           D    0 23613      
>>>> 1 0x00000080
>>>> Sep  1 13:57:26 localhost kernel: Call Trace:
>>>> Sep  1 13:57:26 localhost kernel: __schedule+0x28d/0x890
>>>> Sep  1 13:57:26 localhost kernel: schedule+0x36/0x80
>>>> Sep  1 13:57:26 localhost kernel: raid5_quiesce+0x274/0x2b0 [raid456]
>>>> Sep  1 13:57:26 localhost kernel: ? remove_wait_queue+0x60/0x60
>>>> Sep  1 13:57:26 localhost kernel: suspend_lo_store+0x82/0xe0 [md_mod]
>>>> Sep  1 13:57:26 localhost kernel: md_attr_store+0x80/0xc0 [md_mod]
>>>> Sep  1 13:57:26 localhost kernel: sysfs_kf_write+0x3a/0x50
>>>> Sep  1 13:57:26 localhost kernel: kernfs_fop_write+0xff/0x180
>>>> Sep  1 13:57:26 localhost kernel: __vfs_write+0x37/0x170
>>>> Sep  1 13:57:26 localhost kernel: ? selinux_file_permission+0xe5/0x120
>>>> Sep  1 13:57:26 localhost kernel: ? security_file_permission+0x3b/0xc0
>>>> Sep  1 13:57:26 localhost kernel: vfs_write+0xb2/0x1b0
>>>> Sep  1 13:57:26 localhost kernel: ? syscall_trace_enter+0x1d0/0x2b0
>>>> Sep  1 13:57:26 localhost kernel: SyS_write+0x55/0xc0
>>>> Sep  1 13:57:26 localhost kernel: do_syscall_64+0x67/0x150
>>>>
>>>> It's stuck at:
>>>> conf->quiesce = 2;
>>>> wait_event_cmd(conf->wait_for_quiescent,
>>>>     atomic_read(&conf->active_stripes) == 0 &&
>>>>     atomic_read(&conf->active_aligned_reads) == 0,
>>>>     unlock_all_device_hash_locks_irq(conf),
>>>>     lock_all_device_hash_locks_irq(conf));
>>>> conf->quiesce = 1;
>>>>
>>>> [root@dell-pr1700-02 ~]# cat /proc/mdstat
>>>> Personalities : [raid6] [raid5] [raid4]
>>>> md0 : active raid5 loop6[7] loop4[6] loop5[5](S) loop3[3] loop2[2] 
>>>> loop1[1] loop0[0]
>>>>        2039808 blocks super 1.2 level 5, 512k chunk, algorithm 2 
>>>> [6/6] [UUUUUU]
>>>>        [>....................]  reshape =  0.5% (2560/509952) 
>>>> finish=162985.0min speed=0K/sec
>>>>        unused devices: <none>
>>>>
>>>>
>>>> I can reproduce this. So I add some logs by printk to check 
>>>> MD_SB_CHANGE_PENDING of mddev->flags
>>>> and mddev->suspended.
>>>> mddev->suspend : 0
>>>> mddev->flags   : SP CHANGE PENDING is set
>>>> conf->quiesce  : 2
>>>>
>>>> The I echo active > /sys/block/md0/md/array_state, reshape can start 
>>>> and finish successfully.
>>>>
>>>> I notice there are some fixes for raid5 stuck problem. Not sure 
>>>> whether it's introduced by those
>>>> patches.
>>>>
>>>> Best Regards
>>>> Xiao
>>> Thanks for the detailed report.
>>> All the processes that you have mentioned are waiting for
>>> raid5_quiesce() to complete.
>>> raid5_quiesce() must be waiting for either active_stripes or
>>> active_aligned_reads to reach zero.  Most likely active_stripes.
>>>
>>> That suggests that either there is a stripe that is active and isn't
>>> being handled for some reason, or the accounting of active stripes has
>>> problems.
>>>
>>> The fact that setting the array to 'active' removes the hang is
>>> interesting.  That would cause the threads blocked in md_write_start to
>>> continue, but I cannot see how that would affect active_stripes as
>>> threads waiting for md_write_start() never hold a stripe.  Doing that
>>> could cause a stripe to be used, then released, which might trigger a
>>> wakeup that had been lost.
>> At first I doubt there are some active strips that haven't been 
>> handled yet.
>> But the process raid5d doesn't appear in calltrace. It wasn't stuck.
>>>
>>> It would be useful to see if ->active_stripes is 0 when the deadlock
>>> happens.
>>> One way to do that is to add a printk into raid5_status().  Then
>>> whenever you "cat /proc/mdstat" the printk will trigger and tell you the
>>> value of ->active_stripes.
>>>
>>> I haven't managed to reproduce this yet, or I would track it down my
>>> self.
>>
>> Sure the testing is running again. I'll give the information later.
>
> Hi Neil
>
> I have reproduced this.
>
> [root@dell-pr1700-02 ~]# cat /proc/mdstat
> Personalities : [raid6] [raid5] [raid4]
> md0 : active raid5 loop6[7] loop4[6] loop5[5](S) loop3[3] loop2[2] 
> loop1[1] loop0[0]
>        2039808 blocks super 1.2 level 5, 512k chunk, algorithm 2 [6/6] 
> [UUUUUU]
>        [>....................]  reshape =  0.5% (2560/509952) 
> finish=2498.2min speed=3K/sec
>
> unused devices: <none>
> [root@dell-pr1700-02 ~]# dmesg
> [ 3072.826251] suspend : 0
> [ 3072.828683] SP CHANGE PENDING is set
> [ 3072.832243] conf quiesce 2
> [ 3072.834932] active stripes : 251
> [ 3072.838139] active aligned reads : 0
>
> [root@dell-pr1700-02 ~]# ps auxf | grep md
> ...
> root      1682  0.0  0.0      0     0 ?        S<   23:02   0:00 \_ [md]
> root      3103 97.2  0.0      0     0 ?        R    23:34  14:06  \_ 
> [md0_raid5]
> root      3123  0.0  0.0      0     0 ?        D    23:34   0:00  \_ 
> [jbd2/md0-8]
> root      3134  0.0  0.0      0     0 ?        D    23:34   0:00  \_ 
> [md0_reshape]
> ....
> root      3142  0.5  0.0 108996  2728 pts/1    D+   23:34   0:04 
> |           \_ dd if=/dev/urandom of=/mnt/md_test/testfile bs=1M count=1000
> root      3223  0.0  0.0 112672  2300 pts/0    S+   23:48 0:00          
> \_ grep --color=auto md
> root      3141  0.0  0.0  10132  5288 ?        DLs  23:34   0:00 
> /usr/sbin/mdadm --grow --continue /dev/md0
>
> It's strange, raid5d is running but it don't handle active stripes.
> Do you need more information?

It would help a lot to know waht md0_raid5 is doing.  It appears to be
spinning and that is presumably the cause of the problem.

You could try

  echo file raid5.c +p > /sys/kernel/debug/dynamic_debug/control

or see if /proc/3103/stack contains anything useful.

Maybe perf could be used to see where it is spending its time.

NeilBrown
Attachment:
signature.asc

Description: PGP signature