On Thu, Mar 16, 2023 at 8:25=E2=80=AFAM Marc Smith <msmith626@xxxxxxxxx>
wr=
ote:
>
> On Tue, Mar 14, 2023 at 10:45=E2=80=AFAM Marc Smith
<msmith626@xxxxxxxxx>=
wrote:
> >
> > On Tue, Mar 14, 2023 at 9:55=E2=80=AFAM Guoqing Jiang
<guoqing.jiang@li=
nux.dev> wrote:
> > >
> > >
> > >
> > > On 3/14/23 21:25, Marc Smith wrote:
> > > > On Mon, Feb 8, 2021 at 7:49=E2=80=AFPM Guoqing Jiang
> > > > <guoqing.jiang@xxxxxxxxxxxxxxx> wrote:
> > > >> Hi Donald,
> > > >>
> > > >> On 2/8/21 19:41, Donald Buczek wrote:
> > > >>> Dear Guoqing,
> > > >>>
> > > >>> On 08.02.21 15:53, Guoqing Jiang wrote:
> > > >>>>
> > > >>>> On 2/8/21 12:38, Donald Buczek wrote:
> > > >>>>>> 5. maybe don't hold reconfig_mutex when try to unregister
> > > >>>>>> sync_thread, like this.
> > > >>>>>>
> > > >>>>>> /* resync has finished, collect result */
> > > >>>>>> mddev_unlock(mddev);
> > > >>>>>> md_unregister_thread(&mddev->sync_thread);
> > > >>>>>> mddev_lock(mddev);
> > > >>>>> As above: While we wait for the sync thread to terminate,
would=
n't it
> > > >>>>> be a problem, if another user space operation takes the
mutex?
> > > >>>> I don't think other places can be blocked while hold mutex,
othe=
rwise
> > > >>>> these places can cause potential deadlock. Please try above
two =
lines
> > > >>>> change. And perhaps others have better idea.
> > > >>> Yes, this works. No deadlock after >11000 seconds,
> > > >>>
> > > >>> (Time till deadlock from previous runs/seconds: 1723, 37,
434, 12=
65,
> > > >>> 3500, 1136, 109, 1892, 1060, 664, 84, 315, 12, 820 )
> > > >> Great. I will send a formal patch with your reported-by and
tested=
-by.
> > > >>
> > > >> Thanks,
> > > >> Guoqing
> > > > I'm still hitting this issue with Linux 5.4.229 -- it looks
like 1/=
2
> > > > of the patches that supposedly resolve this were applied to
the
sta=
ble
> > > > kernels, however, one was omitted due to a regression:
> > > > md: don't unregister sync_thread with reconfig_mutex held
(upstream
> > > > commit 8b48ec23cc51a4e7c8dbaef5f34ebe67e1a80934)
> > > >
> > > > I don't see any follow-up on the thread from June 8th 2022
asking f=
or
> > > > this patch to be dropped from all stable kernels since it
caused a
> > > > regression.
> > > >
> > > > The patch doesn't appear to be present in the current mainline
kern=
el
> > > > (6.3-rc2) either. So I assume this issue is still present
there, or=
it
> > > > was resolved differently and I just can't find the
commit/patch.
> > >
> > > It should be fixed by commit 9dfbdafda3b3"md: unlock mddev
before
rea=
p
> > > sync_thread in action_store".
> >
> > Okay, let me try applying that patch... it does not appear to be
> > present in my 5.4.229 kernel source. Thanks.
>
> Yes, applying this '9dfbdafda3b3 "md: unlock mddev before reap
> sync_thread in action_store"' patch on top of vanilla 5.4.229 source
> appears to fix the problem for me -- I can't reproduce the issue
with
> the script, and it's been running for >24 hours now. (Previously
I was
> able to induce the issue within a matter of minutes.)
Hi Marc,
Could you please run your reproducer on the md-tmp branch?
https://git.kernel.org/pub/scm/linux/kernel/git/song/md.git/log/?h=3Dmd-tmp
This contains a different version of the fix by Yu Kuai.
Thanks,
Song