Re: [PATCH 1/1] md/raid10: avoid deadlock on recovery.

Song Liu <liu.song.a23@xxxxxxxxx> · Tue, 21 Jul 2020 23:18:19 -0700

On Tue, Jul 21, 2020 at 7:26 AM Nigel Croxon <ncroxon@xxxxxxxxxx> wrote:
>
>
> > On Mar 3, 2020, at 1:14 PM, Vitaly Mayatskikh <vmayatskikh@xxxxxxxxxxxxxxxx> wrote:
> >
> > When disk failure happens and the array has a spare drive, resync thread
> > kicks in and starts to refill the spare. However it may get blocked by
> > a retry thread that resubmits failed IO to a mirror and itself can get
> > blocked on a barrier raised by the resync thread.
> >
> > Signed-off-by: Vitaly Mayatskikh <vmayatskikh@xxxxxxxxxxxxxxxx>
> > ---
> > drivers/md/raid10.c | 14 +++++++++++---
> > 1 file changed, 11 insertions(+), 3 deletions(-)
> >
> > diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
> > index ec136e4..f1a8e26 100644
> > --- a/drivers/md/raid10.c
> > +++ b/drivers/md/raid10.c
> > @@ -980,6 +980,7 @@ static void wait_barrier(struct r10conf *conf)
> > {
> >       spin_lock_irq(&conf->resync_lock);
> >       if (conf->barrier) {
> > +             struct bio_list *bio_list = current->bio_list;
> >               conf->nr_waiting++;
> >               /* Wait for the barrier to drop.
> >                * However if there are already pending
> > @@ -994,9 +995,16 @@ static void wait_barrier(struct r10conf *conf)
> >               wait_event_lock_irq(conf->wait_barrier,
> >                                   !conf->barrier ||
> >                                   (atomic_read(&conf->nr_pending) &&
> > -                                  current->bio_list &&
> > -                                  (!bio_list_empty(&current->bio_list[0]) ||
> > -                                   !bio_list_empty(&current->bio_list[1]))),
> > +                                  bio_list &&
> > +                                  (!bio_list_empty(&bio_list[0]) ||
> > +                                   !bio_list_empty(&bio_list[1]))) ||
> > +                                  /* move on if recovery thread is
> > +                                   * blocked by us
> > +                                   */
> > +                                  (conf->mddev->thread->tsk == current &&
> > +                                   test_bit(MD_RECOVERY_RUNNING,
> > +                                            &conf->mddev->recovery) &&
> > +                                   conf->nr_queued > 0),
> >                                   conf->resync_lock);
> >               conf->nr_waiting--;
> >               if (!conf->nr_waiting)
> > —
> > 1.8.3.1
> >
>
> Song, Have you had a chance to look at this patch?
> We would like to have it pulled in to the kernel.

I am sorry I missed this one. This looks good to me.

Nigel, would you like to add your Reviewed-by, or Acked-by, or Tested-by tag?

Thanks,
Song