Re: [bug report] INFO: task mdX_resync:42168 blocked for more than 122 seconds

Changhui Zhong <czhong@xxxxxxxxxx> · Mon, 20 May 2024 18:47:52 +0800

On Mon, May 20, 2024 at 3:27 PM Yu Kuai <yukuai1@xxxxxxxxxxxxxxx> wrote:
>
> Hi,
>
> 在 2024/05/20 10:55, Yu Kuai 写道:
> > Hi, Changhui
> >
> > 在 2024/05/20 8:39, Changhui Zhong 写道:
> >> [czhong@vm linux-block]$ git bisect bad
> >> 060406c61c7cb4bbd82a02d179decca9c9bb3443 is the first bad commit
> >> commit 060406c61c7cb4bbd82a02d179decca9c9bb3443
> >> Author: Yu Kuai<yukuai3@xxxxxxxxxx>
> >> Date:   Thu May 9 20:38:25 2024 +0800
> >>
> >>      block: add plug while submitting IO
> >>
> >>      So that if caller didn't use plug, for example,
> >> __blkdev_direct_IO_simple()
> >>      and __blkdev_direct_IO_async(), block layer can still benefit
> >> from caching
> >>      nsec time in the plug.
> >>
> >>      Signed-off-by: Yu Kuai<yukuai3@xxxxxxxxxx>
> >>
> >> Link:https://lore.kernel.org/r/20240509123825.3225207-1-yukuai1@xxxxxxxxxxxxxxx
> >>
> >>      Signed-off-by: Jens Axboe<axboe@xxxxxxxxx>
> >>
> >>   block/blk-core.c | 6 ++++++
> >>   1 file changed, 6 insertions(+)
> >
> > Thanks for the test!
> >
> > I was surprised to see this blamed commit, and after taking a look at
> > raid1 barrier code, I found that there are some known problems, fixed in
> > raid10, while raid1 still unfixed. So I wonder this patch maybe just
> > making the exist problem easier to reporduce.
> >
> > I'll start cooking patches to sync raid10 fixes to raid1, meanwhile,
> > can you change your script to test raid10 as well, if raid10 is fine,
> > I'll give you these patches later to test raid1.
>
> Hi,
>
> Sorry to ask, but since I can't reporduce the problem, and based on
> code reiview, there are multiple potential problems, can you also
> reporduce the problem with following debug patch(just add some debug
> info, no functional changes). So that I can make sure of details of
> the problem.
>

Hi，Kuai

yeah， I can test your patch，
but I hit a problem when applying the patch, please help check it, and
I will test it again after you fix it.

```
patching file drivers/md/raid1.c
patch: **** malformed patch at line 42: idx, nr, RESYNC_DEPTH);
```

Thanks,
Changhui

> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
> index 113135e7b5f2..b35b847a9e8b 100644
> --- a/drivers/md/raid1.c
> +++ b/drivers/md/raid1.c
> @@ -936,6 +936,45 @@ static void flush_pending_writes(struct r1conf *conf)
>                  spin_unlock_irq(&conf->device_lock);
>   }
>
> +static bool waiting_barrier(struct r1conf *conf, int idx)
> +{
> +       int nr = atomic_read(&conf->nr_waiting[idx]);
> +
> +       if (nr) {
> +               printk("%s: idx %d nr_waiting %d\n", __func__, idx, nr);
> +               return true;
> +       }
> +
> +       return false;
> +}
> +
> +static bool waiting_pending(struct r1conf *conf, int idx)
> +{
> +       int nr;
> +
> +       if (test_bit(MD_RECOVERY_INTR, &conf->mddev->recovery))
> +               return false;
> +
> +       if (conf->array_frozen) {
> +               printk("%s: array is frozen\n", __func__);
> +               return true;
> +       }
> +
> +       nr = atomic_read(&conf->nr_pending[idx]);
> +       if (nr) {
> +               printk("%s: idx %d nr_pending %d\n", __func__, idx, nr);
> +               return true;
> +       }
> +
> +       nr = atomic_read(&conf->barrier[idx]);
> +       if (nr >= RESYNC_DEPTH) {
> +               printk("%s: idx %d barrier %d exceeds %d\n", __func__,
> idx, nr, RESYNC_DEPTH);
> +               return true;
> +       }
> +
> +       return false;
> +}
> +
>   /* Barriers....
>    * Sometimes we need to suspend IO while we do something else,
>    * either some resync/recovery, or reconfigure the array.
> @@ -967,8 +1006,7 @@ static int raise_barrier(struct r1conf *conf,
> sector_t sector_nr)
>          spin_lock_irq(&conf->resync_lock);
>
>          /* Wait until no block IO is waiting */
> -       wait_event_lock_irq(conf->wait_barrier,
> -                           !atomic_read(&conf->nr_waiting[idx]),
> +       wait_event_lock_irq(conf->wait_barrier, !waiting_barrier(conf, idx),
>                              conf->resync_lock);
>
>          /* block any new IO from starting */
> @@ -990,11 +1028,7 @@ static int raise_barrier(struct r1conf *conf,
> sector_t sector_nr)
>           * C: while conf->barrier[idx] >= RESYNC_DEPTH, meaning reaches
>           *    max resync count which allowed on current I/O barrier bucket.
>           */
> -       wait_event_lock_irq(conf->wait_barrier,
> -                           (!conf->array_frozen &&
> -                            !atomic_read(&conf->nr_pending[idx]) &&
> -                            atomic_read(&conf->barrier[idx]) <
> RESYNC_DEPTH) ||
> -                               test_bit(MD_RECOVERY_INTR,
> &conf->mddev->recovery),
> +       wait_event_lock_irq(conf->wait_barrier, !waiting_pending(conf, idx),
>                              conf->resync_lock);
>
>          if (test_bit(MD_RECOVERY_INTR, &conf->mddev->recovery)) {
>
> Thanks,
> Kuai
>
> >
> > Thanks,
> > Kuai
> >
> > .
> >
>