Re: [RFC PATCH v4 4/4] md: raid456 add nowait support

Jens Axboe <axboe@xxxxxxxxx> · Mon, 13 Dec 2021 16:35:27 -0700

On 12/13/21 3:43 PM, Vishal Verma wrote:
> 
> On 12/12/21 10:56 PM, Song Liu wrote:
>> On Fri, Dec 10, 2021 at 10:26 AM Vishal Verma <vverma@xxxxxxxxxxxxxxxx> wrote:
>>>
>>> On 12/9/21 7:16 PM, Song Liu wrote:
>>>> On Wed, Nov 10, 2021 at 10:15 AM Vishal Verma <vverma@xxxxxxxxxxxxxxxx> wrote:
>>>>> Returns EAGAIN in case the raid456 driver would block
>>>>> waiting for situations like:
>>>>>
>>>>>     - Reshape operation,
>>>>>     - Discard operation.
>>>>>
>>>>> Signed-off-by: Vishal Verma <vverma@xxxxxxxxxxxxxxxx>
>>>>> ---
>>>>>    drivers/md/raid5.c | 14 ++++++++++++++
>>>>>    1 file changed, 14 insertions(+)
>>>>>
>>>>> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
>>>>> index 9c1a5877cf9f..fa64ee315241 100644
>>>>> --- a/drivers/md/raid5.c
>>>>> +++ b/drivers/md/raid5.c
>>>>> @@ -5710,6 +5710,11 @@ static void make_discard_request(struct mddev *mddev, struct bio *bi)
>>>>>                   int d;
>>>>>           again:
>>>>>                   sh = raid5_get_active_stripe(conf, logical_sector, 0, 0, 0);
>>>>> +               /* Bail out if REQ_NOWAIT is set */
>>>>> +               if (bi->bi_opf & REQ_NOWAIT) {
>>>>> +                       bio_wouldblock_error(bi);
>>>>> +                       return;
>>>>> +               }
>>>> This is not right. raid5_get_active_stripe() gets refcount on the sh,
>>>> we cannot simply
>>>> return here. I think we need the logic after raid5_release_stripe()
>>>> and before schedule().
>>>>
>>>>>                   prepare_to_wait(&conf->wait_for_overlap, &w,
>>>>>                                   TASK_UNINTERRUPTIBLE);
>>>>>                   set_bit(R5_Overlap, &sh->dev[sh->pd_idx].flags);
>>>>> @@ -5820,6 +5825,15 @@ static bool raid5_make_request(struct mddev *mddev, struct bio * bi)
>>>>>           bi->bi_next = NULL;
>>>>>
>>>>>           md_account_bio(mddev, &bi);
>>>>> +       /* Bail out if REQ_NOWAIT is set */
>>>>> +       if (bi->bi_opf & REQ_NOWAIT &&
>>>>> +           conf->reshape_progress != MaxSector &&
>>>>> +           mddev->reshape_backwards
>>>>> +           ? logical_sector < conf->reshape_safe
>>>>> +           : logical_sector >= conf->reshape_safe) {
>>>>> +               bio_wouldblock_error(bi);
>>>>> +               return true;
>>>>> +       }
>>>> This is also problematic, and is the trigger of those error messages.
>>>> We only want to trigger -EAGAIN when logical_sector is between
>>>> reshape_progress and reshape_safe.
>>>>
>>>> Just to clarify, did you mean doing something like:
>>>> if (bi->bi_opf & REQ_NOWAIT &&
>>>> +           conf->reshape_progress != MaxSector &&
>>>> +           (mddev->reshape_backwards
>>>> +           ? (logical_sector > conf->reshape_progress && logical_sector < conf->reshape_safe)
>>>> +           : logical_sector >= conf->reshape_safe)) {
>> I think this should be
>>    :   (logical_sector >= conf->reshape_safe && logical_sector <
>> conf->reshape_progress)
> 
> 
> if (bi->bi_opf & REQ_NOWAIT &&
>                  conf->reshape_progress != MaxSector &&
>                  (mddev->reshape_backwards
>                  ? (logical_sector > conf->reshape_progress && 
> logical_sector <= conf->reshape_safe)
>                  : (logical_sector >= conf->reshape_safe && 
> logical_sector < conf->reshape_progress))) {
>                          bio_wouldblock_error(bi);
>                          return true;
>          }
> 
> After making this change along with other changes, I ran some tests with 
> 100% reads, 70%read30%writes and 100% writes on a clean raid5 array.
> 
> Unfortunately, I ran into this following task hung with 100% writes 
> (with both libaio and io_uring):
> 
> [21876.856692] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
> disables this message.
> [21876.864518] task:md5_raid5       state:D stack:    0 pid:11675 
> ppid:     2 flags:0x00004000
> [21876.864522] Call Trace:
> [21876.864526]  __schedule+0x2d4/0x970
> [21876.864532]  ? wbt_cleanup_cb+0x20/0x20
> [21876.864535]  schedule+0x4e/0xb0
> [21876.864537]  io_schedule+0x3f/0x70
> [21876.864539]  rq_qos_wait+0xb9/0x130
> [21876.864542]  ? sysv68_partition+0x280/0x280
> [21876.864543]  ? wbt_cleanup_cb+0x20/0x20
> [21876.864545]  wbt_wait+0x92/0xc0
> [21876.864546]  __rq_qos_throttle+0x25/0x40
> [21876.864548]  blk_mq_submit_bio+0xc6/0x5d0
> [21876.864551]  ? submit_bio_checks+0x39e/0x5f0
> [21876.864554]  __submit_bio+0x1bc/0x1d0
> [21876.864555]  ? kmem_cache_free+0x378/0x3c0
> [21876.864558]  ? mempool_free_slab+0x17/0x20
> [21876.864562]  submit_bio_noacct+0x256/0x2a0
> [21876.864565]  0xffffffffc01fa6d9
> [21876.864568]  ? 0xffffffffc01f5d01
> [21876.864569]  raid5_get_active_stripe+0x16c0/0x3e00 [raid456]
> [21876.864571]  ? __wake_up_common_lock+0x8a/0xc0
> [21876.864575]  raid5_get_active_stripe+0x2839/0x3e00 [raid456]
> [21876.864577]  raid5_get_active_stripe+0x2d6e/0x3e00 [raid456]
> [21876.864579]  md_thread+0xae/0x170
> [21876.864581]  ? wait_woken+0x60/0x60
> [21876.864582]  ? md_start_sync+0x60/0x60
> [21876.864584]  kthread+0x127/0x150
> [21876.864586]  ? set_kthread_struct+0x40/0x40
> [21876.864588]  ret_from_fork+0x1f/0x30

What kernel base are you using for your patches?

-- 
Jens Axboe