On Tue, Mar 14, 2023 at 10:45 AM Marc Smith <msmith626@xxxxxxxxx> wrote: > > On Tue, Mar 14, 2023 at 9:55 AM Guoqing Jiang <guoqing.jiang@xxxxxxxxx> wrote: > > > > > > > > On 3/14/23 21:25, Marc Smith wrote: > > > On Mon, Feb 8, 2021 at 7:49 PM Guoqing Jiang > > > <guoqing.jiang@xxxxxxxxxxxxxxx> wrote: > > >> Hi Donald, > > >> > > >> On 2/8/21 19:41, Donald Buczek wrote: > > >>> Dear Guoqing, > > >>> > > >>> On 08.02.21 15:53, Guoqing Jiang wrote: > > >>>> > > >>>> On 2/8/21 12:38, Donald Buczek wrote: > > >>>>>> 5. maybe don't hold reconfig_mutex when try to unregister > > >>>>>> sync_thread, like this. > > >>>>>> > > >>>>>> /* resync has finished, collect result */ > > >>>>>> mddev_unlock(mddev); > > >>>>>> md_unregister_thread(&mddev->sync_thread); > > >>>>>> mddev_lock(mddev); > > >>>>> As above: While we wait for the sync thread to terminate, wouldn't it > > >>>>> be a problem, if another user space operation takes the mutex? > > >>>> I don't think other places can be blocked while hold mutex, otherwise > > >>>> these places can cause potential deadlock. Please try above two lines > > >>>> change. And perhaps others have better idea. > > >>> Yes, this works. No deadlock after >11000 seconds, > > >>> > > >>> (Time till deadlock from previous runs/seconds: 1723, 37, 434, 1265, > > >>> 3500, 1136, 109, 1892, 1060, 664, 84, 315, 12, 820 ) > > >> Great. I will send a formal patch with your reported-by and tested-by. > > >> > > >> Thanks, > > >> Guoqing > > > I'm still hitting this issue with Linux 5.4.229 -- it looks like 1/2 > > > of the patches that supposedly resolve this were applied to the stable > > > kernels, however, one was omitted due to a regression: > > > md: don't unregister sync_thread with reconfig_mutex held (upstream > > > commit 8b48ec23cc51a4e7c8dbaef5f34ebe67e1a80934) > > > > > > I don't see any follow-up on the thread from June 8th 2022 asking for > > > this patch to be dropped from all stable kernels since it caused a > > > regression. > > > > > > The patch doesn't appear to be present in the current mainline kernel > > > (6.3-rc2) either. So I assume this issue is still present there, or it > > > was resolved differently and I just can't find the commit/patch. > > > > It should be fixed by commit 9dfbdafda3b3"md: unlock mddev before reap > > sync_thread in action_store". > > Okay, let me try applying that patch... it does not appear to be > present in my 5.4.229 kernel source. Thanks. Yes, applying this '9dfbdafda3b3 "md: unlock mddev before reap sync_thread in action_store"' patch on top of vanilla 5.4.229 source appears to fix the problem for me -- I can't reproduce the issue with the script, and it's been running for >24 hours now. (Previously I was able to induce the issue within a matter of minutes.) > > --Marc > > > > > > Thanks, > > Guoqing