On Mon, Feb 8, 2021 at 7:49 PM Guoqing Jiang <guoqing.jiang@xxxxxxxxxxxxxxx> wrote: > > Hi Donald, > > On 2/8/21 19:41, Donald Buczek wrote: > > Dear Guoqing, > > > > On 08.02.21 15:53, Guoqing Jiang wrote: > >> > >> > >> On 2/8/21 12:38, Donald Buczek wrote: > >>>> 5. maybe don't hold reconfig_mutex when try to unregister > >>>> sync_thread, like this. > >>>> > >>>> /* resync has finished, collect result */ > >>>> mddev_unlock(mddev); > >>>> md_unregister_thread(&mddev->sync_thread); > >>>> mddev_lock(mddev); > >>> > >>> As above: While we wait for the sync thread to terminate, wouldn't it > >>> be a problem, if another user space operation takes the mutex? > >> > >> I don't think other places can be blocked while hold mutex, otherwise > >> these places can cause potential deadlock. Please try above two lines > >> change. And perhaps others have better idea. > > > > Yes, this works. No deadlock after >11000 seconds, > > > > (Time till deadlock from previous runs/seconds: 1723, 37, 434, 1265, > > 3500, 1136, 109, 1892, 1060, 664, 84, 315, 12, 820 ) > > Great. I will send a formal patch with your reported-by and tested-by. > > Thanks, > Guoqing I'm still hitting this issue with Linux 5.4.229 -- it looks like 1/2 of the patches that supposedly resolve this were applied to the stable kernels, however, one was omitted due to a regression: md: don't unregister sync_thread with reconfig_mutex held (upstream commit 8b48ec23cc51a4e7c8dbaef5f34ebe67e1a80934) I don't see any follow-up on the thread from June 8th 2022 asking for this patch to be dropped from all stable kernels since it caused a regression. The patch doesn't appear to be present in the current mainline kernel (6.3-rc2) either. So I assume this issue is still present there, or it was resolved differently and I just can't find the commit/patch. I can induce the issue by using Donald's script above which will eventually result in hangs: ... 147948.504621] INFO: task md_test_2.sh:68033 blocked for more than 122 seconds. [147948.504624] Tainted: P OE 5.4.229-esos.prod #1 [147948.504624] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [147948.504625] md_test_2.sh D 0 68033 1 0x00000004 [147948.504627] Call Trace: [147948.504634] __schedule+0x4ab/0x4f3 [147948.504637] ? usleep_range+0x7a/0x7a [147948.504638] schedule+0x67/0x81 [147948.504639] schedule_timeout+0x2c/0xe5 [147948.504643] ? do_raw_spin_lock+0x2b/0x52 [147948.504644] __wait_for_common+0xc4/0x13a [147948.504647] ? wake_up_q+0x40/0x40 [147948.504649] kthread_stop+0x9a/0x117 [147948.504653] md_unregister_thread+0x43/0x4d [147948.504655] md_reap_sync_thread+0x1c/0x1d5 [147948.504657] action_store+0xc9/0x284 [147948.504658] md_attr_store+0x9f/0xb8 [147948.504661] kernfs_fop_write+0x10a/0x14c [147948.504664] vfs_write+0xa0/0xdd [147948.504666] ksys_write+0x71/0xba [147948.504668] do_syscall_64+0x52/0x60 [147948.504671] entry_SYSCALL_64_after_hwframe+0x5c/0xc1 ... [147948.504748] INFO: task md120_resync:135315 blocked for more than 122 seconds. [147948.504749] Tainted: P OE 5.4.229-esos.prod #1 [147948.504749] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [147948.504749] md120_resync D 0 135315 2 0x80004000 [147948.504750] Call Trace: [147948.504752] __schedule+0x4ab/0x4f3 [147948.504754] ? printk+0x53/0x6a [147948.504755] schedule+0x67/0x81 [147948.504756] md_do_sync+0xae7/0xdd9 [147948.504758] ? remove_wait_queue+0x41/0x41 [147948.504759] md_thread+0x128/0x151 [147948.504761] ? _raw_spin_lock_irqsave+0x31/0x5d [147948.504762] ? md_start_sync+0xdc/0xdc [147948.504763] kthread+0xe4/0xe9 [147948.504764] ? kthread_flush_worker+0x70/0x70 [147948.504765] ret_from_fork+0x35/0x40 ... This happens on 'raid6' MD RAID arrays that initially have sync_action==resync. Any guidance would be greatly appreciated. --Marc