Re: [RFC PATCH v4 4/4] md: raid456 add nowait support

Song Liu <song@xxxxxxxxxx> · Mon, 13 Dec 2021 16:36:18 -0800

On Mon, Dec 13, 2021 at 2:43 PM Vishal Verma <vverma@xxxxxxxxxxxxxxxx> wrote:
>
>
> On 12/12/21 10:56 PM, Song Liu wrote:
> > On Fri, Dec 10, 2021 at 10:26 AM Vishal Verma <vverma@xxxxxxxxxxxxxxxx> wrote:
> >>
> >> On 12/9/21 7:16 PM, Song Liu wrote:
> >>> On Wed, Nov 10, 2021 at 10:15 AM Vishal Verma <vverma@xxxxxxxxxxxxxxxx> wrote:
> >>>> Returns EAGAIN in case the raid456 driver would block
> >>>> waiting for situations like:
> >>>>
> >>>>     - Reshape operation,
> >>>>     - Discard operation.
> >>>>
> >>>> Signed-off-by: Vishal Verma <vverma@xxxxxxxxxxxxxxxx>
> >>>> ---
> >>>>    drivers/md/raid5.c | 14 ++++++++++++++
> >>>>    1 file changed, 14 insertions(+)
> >>>>
> >>>> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> >>>> index 9c1a5877cf9f..fa64ee315241 100644
> >>>> --- a/drivers/md/raid5.c
> >>>> +++ b/drivers/md/raid5.c
> >>>> @@ -5710,6 +5710,11 @@ static void make_discard_request(struct mddev *mddev, struct bio *bi)
> >>>>                   int d;
> >>>>           again:
> >>>>                   sh = raid5_get_active_stripe(conf, logical_sector, 0, 0, 0);
> >>>> +               /* Bail out if REQ_NOWAIT is set */
> >>>> +               if (bi->bi_opf & REQ_NOWAIT) {
> >>>> +                       bio_wouldblock_error(bi);
> >>>> +                       return;
> >>>> +               }
> >>> This is not right. raid5_get_active_stripe() gets refcount on the sh,
> >>> we cannot simply
> >>> return here. I think we need the logic after raid5_release_stripe()
> >>> and before schedule().
> >>>
> >>>>                   prepare_to_wait(&conf->wait_for_overlap, &w,
> >>>>                                   TASK_UNINTERRUPTIBLE);
> >>>>                   set_bit(R5_Overlap, &sh->dev[sh->pd_idx].flags);
> >>>> @@ -5820,6 +5825,15 @@ static bool raid5_make_request(struct mddev *mddev, struct bio * bi)
> >>>>           bi->bi_next = NULL;
> >>>>
> >>>>           md_account_bio(mddev, &bi);
> >>>> +       /* Bail out if REQ_NOWAIT is set */
> >>>> +       if (bi->bi_opf & REQ_NOWAIT &&
> >>>> +           conf->reshape_progress != MaxSector &&
> >>>> +           mddev->reshape_backwards
> >>>> +           ? logical_sector < conf->reshape_safe
> >>>> +           : logical_sector >= conf->reshape_safe) {
> >>>> +               bio_wouldblock_error(bi);
> >>>> +               return true;
> >>>> +       }
> >>> This is also problematic, and is the trigger of those error messages.
> >>> We only want to trigger -EAGAIN when logical_sector is between
> >>> reshape_progress and reshape_safe.
> >>>
> >>> Just to clarify, did you mean doing something like:
> >>> if (bi->bi_opf & REQ_NOWAIT &&
> >>> +           conf->reshape_progress != MaxSector &&
> >>> +           (mddev->reshape_backwards
> >>> +           ? (logical_sector > conf->reshape_progress && logical_sector < conf->reshape_safe)
> >>> +           : logical_sector >= conf->reshape_safe)) {
> > I think this should be
> >    :   (logical_sector >= conf->reshape_safe && logical_sector <
> > conf->reshape_progress)
>
>
> if (bi->bi_opf & REQ_NOWAIT &&
>                  conf->reshape_progress != MaxSector &&
>                  (mddev->reshape_backwards
>                  ? (logical_sector > conf->reshape_progress &&
> logical_sector <= conf->reshape_safe)
>                  : (logical_sector >= conf->reshape_safe &&
> logical_sector < conf->reshape_progress))) {
>                          bio_wouldblock_error(bi);
>                          return true;
>          }
>
> After making this change along with other changes, I ran some tests with
> 100% reads, 70%read30%writes and 100% writes on a clean raid5 array.
>
> Unfortunately, I ran into this following task hung with 100% writes
> (with both libaio and io_uring):
>
> [21876.856692] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
> [21876.864518] task:md5_raid5       state:D stack:    0 pid:11675
> ppid:     2 flags:0x00004000
> [21876.864522] Call Trace:
> [21876.864526]  __schedule+0x2d4/0x970
> [21876.864532]  ? wbt_cleanup_cb+0x20/0x20
> [21876.864535]  schedule+0x4e/0xb0
> [21876.864537]  io_schedule+0x3f/0x70
> [21876.864539]  rq_qos_wait+0xb9/0x130
> [21876.864542]  ? sysv68_partition+0x280/0x280
> [21876.864543]  ? wbt_cleanup_cb+0x20/0x20
> [21876.864545]  wbt_wait+0x92/0xc0
> [21876.864546]  __rq_qos_throttle+0x25/0x40
> [21876.864548]  blk_mq_submit_bio+0xc6/0x5d0
> [21876.864551]  ? submit_bio_checks+0x39e/0x5f0
> [21876.864554]  __submit_bio+0x1bc/0x1d0
> [21876.864555]  ? kmem_cache_free+0x378/0x3c0
> [21876.864558]  ? mempool_free_slab+0x17/0x20
> [21876.864562]  submit_bio_noacct+0x256/0x2a0
> [21876.864565]  0xffffffffc01fa6d9
> [21876.864568]  ? 0xffffffffc01f5d01
> [21876.864569]  raid5_get_active_stripe+0x16c0/0x3e00 [raid456]
> [21876.864571]  ? __wake_up_common_lock+0x8a/0xc0
> [21876.864575]  raid5_get_active_stripe+0x2839/0x3e00 [raid456]
> [21876.864577]  raid5_get_active_stripe+0x2d6e/0x3e00 [raid456]
> [21876.864579]  md_thread+0xae/0x170
> [21876.864581]  ? wait_woken+0x60/0x60
> [21876.864582]  ? md_start_sync+0x60/0x60
> [21876.864584]  kthread+0x127/0x150
> [21876.864586]  ? set_kthread_struct+0x40/0x40
> [21876.864588]  ret_from_fork+0x1f/0x30

While there is something suspicious with raid10_handle_discard(), I don't
think it is related to this error. I couldn't reproduce this in my tests.

Thanks,
Song