Hello List, I filed two patches to fix 2 different md-cluster bugs. For easy understanding, Let us call issue 1 (releted with patch 1), and issue 2 (related with patch 2). *** Test env *** node A & B share 3 iSCSI luns: sdg/sdh/sdi. Each lun size is 1GB. The disk size is more large the issues are more likely to trigger. (more resync time, more easily trigger issues) *** Test script *** Issue 1 & 2 can use same test script to trigger: ``` ssh root@node2 "mdadm -S --scan" mdadm -S --scan for i in {g,h,i};do dd if=/dev/zero of=/dev/sd$i oflag=direct bs=1M \ count=20; done echo "mdadm create array" mdadm -C /dev/md0 -b clustered -e 1.2 -n 2 -l mirror /dev/sdg /dev/sdh \ --bitmap-chunk=1M echo "set up array on node2" ssh root@node2 "mdadm -A /dev/md0 /dev/sdg /dev/sdh" sleep 5 mkfs.xfs /dev/md0 mdadm --manage --add /dev/md0 /dev/sdi mdadm --wait /dev/md0 mdadm --grow --raid-devices=3 /dev/md0 mdadm /dev/md0 --fail /dev/sdg mdadm /dev/md0 --remove /dev/sdg #mdadm --wait /dev/md0 mdadm --grow --raid-devices=2 /dev/md0 ``` There is a workaround: when adding the --wait before second --grow, the issue 1 will disappear. *** error behavior *** issue 1: test script can finish every cmds in script, but array status is wrong. issue 2: test script will hung when executing "mdadm --remove". array status of issue 1: (part of output by: mdadm -D /dev/md0) <case 1> : normal test result. ``` Number Major Minor RaidDevice State 1 8 112 0 active sync /dev/sdh 2 8 128 1 active sync /dev/sdi ``` <case 2> : "--faulty" data still exist on disk metadata area. ``` Number Major Minor RaidDevice State - 0 0 0 removed 1 8 112 1 active sync /dev/sdh 2 8 128 2 active sync /dev/sdi 0 8 96 - faulty /dev/sdg ``` <case 3> : "--remove" data still exist on disk metadata area. ``` Number Major Minor RaidDevice State - 0 0 0 removed 1 8 112 1 active sync /dev/sdh 2 8 128 2 active sync /dev/sdi ``` array status of issue 2: Hunging info: ``` node1 # ps axj | grep mdadm 1 5423 5227 2231 ? -1 D 0 0:00 mdadm /dev/md0 --remove /dev/sdg node1 # cat /proc/mdstat Personalities : [raid1] md0 : active raid1 sdi[2] sdh[1] sdg[0](F) 1046528 blocks super 1.2 [2/1] [_U] [>....................] recovery = 0.0% (1/1046528) finish=354.0min speed=47K/sec bitmap: 1/1 pages [4KB], 1024KB chunk unused devices: <none> node2 # cat /proc/mdstat Personalities : [raid1] md0 : active raid1 sdi[2] sdg[0](F) sdh[1] 1046528 blocks super 1.2 [2/1] [_U] bitmap: 1/1 pages [4KB], 1024KB chunk unused devices: <none> echo t > /proc/sysrq-trigger md0_cluster_rec D 0 5329 2 0x80004000 Call Trace: __schedule+0x1f6/0x560 ? _cond_resched+0x2d/0x40 ? schedule+0x4a/0xb0 ? process_metadata_update.isra.0+0xdb/0x140 [md_cluster] ? wait_woken+0x80/0x80 ? process_recvd_msg+0x113/0x1d0 [md_cluster] ? recv_daemon+0x9e/0x120 [md_cluster] ? md_thread+0x94/0x160 [md_mod] ? wait_woken+0x80/0x80 ? md_congested+0x30/0x30 [md_mod] ? kthread+0x115/0x140 ? __kthread_bind_mask+0x60/0x60 ? ret_from_fork+0x1f/0x40 mdadm D 0 5423 1 0x00004004 Call Trace: __schedule+0x1f6/0x560 ? __schedule+0x1fe/0x560 ? schedule+0x4a/0xb0 ? lock_comm.isra.0+0x7b/0xb0 [md_cluster] ? wait_woken+0x80/0x80 ? remove_disk+0x4f/0x90 [md_cluster] ? hot_remove_disk+0xb1/0x1b0 [md_mod] ? md_ioctl+0x50c/0xba0 [md_mod] ? wait_woken+0x80/0x80 ? blkdev_ioctl+0xa2/0x2a0 ? block_ioctl+0x39/0x40 ? ksys_ioctl+0x82/0xc0 ? __x64_sys_ioctl+0x16/0x20 ? do_syscall_64+0x5f/0x150 ? entry_SYSCALL_64_after_hwframe+0x44/0xa9 md0_resync D 0 5425 2 0x80004000 Call Trace: __schedule+0x1f6/0x560 ? schedule+0x4a/0xb0 ? dlm_lock_sync+0xa1/0xd0 [md_cluster] ? wait_woken+0x80/0x80 ? lock_token+0x2d/0x90 [md_cluster] ? resync_info_update+0x95/0x100 [md_cluster] ? raid1_sync_request+0x7d3/0xa40 [raid1] ? md_do_sync.cold+0x737/0xc8f [md_mod] ? md_thread+0x94/0x160 [md_mod] ? md_congested+0x30/0x30 [md_mod] ? kthread+0x115/0x140 ? __kthread_bind_mask+0x60/0x60 ? ret_from_fork+0x1f/0x40 ``` *** analysis *** <issue 1> In cluster env, every node can start resync job even if the resync cmd doesn't execute on it. e.g. There are two node A & B. User executes "mdadm --grow" on A, sometime B will start resync job not A. problem: Current update_raid_disks() only check local recovery status, it's incomplete. issue scenario ``` node A node B ------------------------------ --------------- mdadm --grow -n 3 md0 + raid1_reshape mddev->raid_disks: 2=>3 start resync job, it will block A resync job mddev->raid_disks: 2=>3 mdadm md0 --fail sdg + update disk: array sb & bitmap sb + send METADATA_UPDATE (resync job blocking) (B continue doing "--grow -n 3" resync job) recv METADATA_UPDATE + read disk metadata + raid1_error + set MD_RECOVERY_INTR to break resync ... ... md_check_recovery + remove_and_add_spares return 1 + set MD_RECOVERY_RECOVER, later restart resync mdadm md0 --remove sdg + md_cluster_ops->remove_disk | + send REMOVE + md_kick_rdev_from_array + update disk: array sb & bitmap sb (resync job blocking) (B continue doing "--grow -n 3" resync job) recv REMOVE + process_remove_disk doesn't set mddev->sb_flags, so it doesn't update disk sb & bitmap sb. ...... md_check_recovery + md_kick_rdev_from_array mdadm --grow -n 2 md0 + raid1_reshape | mddev->raid_disks: 3=>2 + send METADATA_UPDATED (B continue doing "--grow -n 3" resync job) recv METADATA_UPDATE + check_sb_changes update_raid_disks return -EBUSY update failed for mddev->raid_disks: 3=>2 (B never successfully update mddev->raid_disks: 3=>2) when B finish "--grow -n 3" resync job + use mddev->raid_disks:3 to update array sb & bitmap sb + send METADATA_UPDATED recv METADATA_UPDATED + read wrong raid_disks to update kernel data. ``` <issue 2> First, There is a similar deadlock in commit 0ba959774e93911caff596de6391f085fb640ac4 Let me explain commit 0ba959774e first. ``` <origin scenario> nodeA nodeB -------------------- -------------------- a. send METADATA_UPDATED held token_lockres:EX b. md_do_sync resync_info_update send RESYNCING + set MD_CLUSTER_SEND_LOCK + wait for holding token_lockres:EX c. update sb + held reconfig_mutex + metadata_update_start + wait_event(MD_CLUSTER_SEND_LOCK) //blocking from <b> d. recv_daemon //METADATA_UPDATED from A process_metadata_update + mddev_trylock(mddev) return false //blocking from <c> <after introduction "MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD"> nodeA nodeB -------------------- -------------------- a. send METADATA_UPDATED held token_lockres:EX b. md_do_sync resync_info_update send RESYNCING + set MD_CLUSTER_SEND_LOCK + wait for holding token_lockres:EX c. update sb + held reconfig_mutex + metadata_update_start wait for + set MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD //for breaking <d> + wait_event(MD_CLUSTER_SEND_LOCK) d. recv_daemon //METADATA_UPDATED from A process_metadata_update + (mddev_trylock(mddev) || MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD) //this will non-block & break deadlock. ``` the issue 2 is very like 0ba959774e, except <c>. ``` nodeA nodeB -------------------- -------------------- a. send METADATA_UPDATED held token_lockres:EX b. md_do_sync resync_info_update send RESYNCING + set MD_CLUSTER_SEND_LOCK + wait for holding token_lockres:EX c. mdadm /dev/md0 --remove /dev/sdg + held reconfig_mutex + send REMOVE + wait_event(MD_CLUSTER_SEND_LOCK) d. recv_daemon //METADATA_UPDATED from A process_metadata_update + (mddev_trylock(mddev) || MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD) //this time, both return false forever. ``` commit 0ba959774e9391 uses MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD to break deadlock, but in issue 2, it won't help. md_cluster_ops->remove_disk called from: - state_store() this doesn't hold reconfig_mutex - hot_remove_disk() this must hold reconfig_mutex There are two method to fix. 1. like commit 0ba959774e, held reconfig_mutex in state_store, and set MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD in remove_disk. 2. like patch 2, change wait_event to wait_event_timeout in lock_comm & process_metadata_update. there are some reason I prefer method 2: - I am not very familiar with all scenario in state_store(). I am not sure if holding reconfig_mutex can cause new bug/issue. - It looks all sendmsg cases could trigger issue 2. Current we found two cases: (maybe there have other cases) - update sb (see commit 0ba959774) - mdadm --remove (issue 2) we should break the deadlock in key code (wait_event => wait_event_timeout). ------- v1: - create patch ------- Zhao Heming (2): md/cluster: reshape should returns error when remote doing resyncing job md/cluster: fix deadlock when doing reshape job drivers/md/md-cluster.c | 42 ++++++++++++++++++++++++++--------------- drivers/md/md.c | 8 ++++++-- 2 files changed, 33 insertions(+), 17 deletions(-) -- 2.27.0