On Mon, Dec 13, 2021 at 2:43 PM Vishal Verma <vverma@xxxxxxxxxxxxxxxx> wrote: > > > On 12/12/21 10:56 PM, Song Liu wrote: > > On Fri, Dec 10, 2021 at 10:26 AM Vishal Verma <vverma@xxxxxxxxxxxxxxxx> wrote: > >> > >> On 12/9/21 7:16 PM, Song Liu wrote: > >>> On Wed, Nov 10, 2021 at 10:15 AM Vishal Verma <vverma@xxxxxxxxxxxxxxxx> wrote: > >>>> Returns EAGAIN in case the raid456 driver would block > >>>> waiting for situations like: > >>>> > >>>> - Reshape operation, > >>>> - Discard operation. > >>>> > >>>> Signed-off-by: Vishal Verma <vverma@xxxxxxxxxxxxxxxx> > >>>> --- > >>>> drivers/md/raid5.c | 14 ++++++++++++++ > >>>> 1 file changed, 14 insertions(+) > >>>> > >>>> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c > >>>> index 9c1a5877cf9f..fa64ee315241 100644 > >>>> --- a/drivers/md/raid5.c > >>>> +++ b/drivers/md/raid5.c > >>>> @@ -5710,6 +5710,11 @@ static void make_discard_request(struct mddev *mddev, struct bio *bi) > >>>> int d; > >>>> again: > >>>> sh = raid5_get_active_stripe(conf, logical_sector, 0, 0, 0); > >>>> + /* Bail out if REQ_NOWAIT is set */ > >>>> + if (bi->bi_opf & REQ_NOWAIT) { > >>>> + bio_wouldblock_error(bi); > >>>> + return; > >>>> + } > >>> This is not right. raid5_get_active_stripe() gets refcount on the sh, > >>> we cannot simply > >>> return here. I think we need the logic after raid5_release_stripe() > >>> and before schedule(). > >>> > >>>> prepare_to_wait(&conf->wait_for_overlap, &w, > >>>> TASK_UNINTERRUPTIBLE); > >>>> set_bit(R5_Overlap, &sh->dev[sh->pd_idx].flags); > >>>> @@ -5820,6 +5825,15 @@ static bool raid5_make_request(struct mddev *mddev, struct bio * bi) > >>>> bi->bi_next = NULL; > >>>> > >>>> md_account_bio(mddev, &bi); > >>>> + /* Bail out if REQ_NOWAIT is set */ > >>>> + if (bi->bi_opf & REQ_NOWAIT && > >>>> + conf->reshape_progress != MaxSector && > >>>> + mddev->reshape_backwards > >>>> + ? logical_sector < conf->reshape_safe > >>>> + : logical_sector >= conf->reshape_safe) { > >>>> + bio_wouldblock_error(bi); > >>>> + return true; > >>>> + } > >>> This is also problematic, and is the trigger of those error messages. > >>> We only want to trigger -EAGAIN when logical_sector is between > >>> reshape_progress and reshape_safe. > >>> > >>> Just to clarify, did you mean doing something like: > >>> if (bi->bi_opf & REQ_NOWAIT && > >>> + conf->reshape_progress != MaxSector && > >>> + (mddev->reshape_backwards > >>> + ? (logical_sector > conf->reshape_progress && logical_sector < conf->reshape_safe) > >>> + : logical_sector >= conf->reshape_safe)) { > > I think this should be > > : (logical_sector >= conf->reshape_safe && logical_sector < > > conf->reshape_progress) > > > if (bi->bi_opf & REQ_NOWAIT && > conf->reshape_progress != MaxSector && > (mddev->reshape_backwards > ? (logical_sector > conf->reshape_progress && > logical_sector <= conf->reshape_safe) > : (logical_sector >= conf->reshape_safe && > logical_sector < conf->reshape_progress))) { > bio_wouldblock_error(bi); > return true; > } > > After making this change along with other changes, I ran some tests with > 100% reads, 70%read30%writes and 100% writes on a clean raid5 array. > > Unfortunately, I ran into this following task hung with 100% writes > (with both libaio and io_uring): > > [21876.856692] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" > disables this message. > [21876.864518] task:md5_raid5 state:D stack: 0 pid:11675 > ppid: 2 flags:0x00004000 > [21876.864522] Call Trace: > [21876.864526] __schedule+0x2d4/0x970 > [21876.864532] ? wbt_cleanup_cb+0x20/0x20 > [21876.864535] schedule+0x4e/0xb0 > [21876.864537] io_schedule+0x3f/0x70 > [21876.864539] rq_qos_wait+0xb9/0x130 > [21876.864542] ? sysv68_partition+0x280/0x280 > [21876.864543] ? wbt_cleanup_cb+0x20/0x20 > [21876.864545] wbt_wait+0x92/0xc0 > [21876.864546] __rq_qos_throttle+0x25/0x40 > [21876.864548] blk_mq_submit_bio+0xc6/0x5d0 > [21876.864551] ? submit_bio_checks+0x39e/0x5f0 > [21876.864554] __submit_bio+0x1bc/0x1d0 > [21876.864555] ? kmem_cache_free+0x378/0x3c0 > [21876.864558] ? mempool_free_slab+0x17/0x20 > [21876.864562] submit_bio_noacct+0x256/0x2a0 > [21876.864565] 0xffffffffc01fa6d9 > [21876.864568] ? 0xffffffffc01f5d01 > [21876.864569] raid5_get_active_stripe+0x16c0/0x3e00 [raid456] > [21876.864571] ? __wake_up_common_lock+0x8a/0xc0 > [21876.864575] raid5_get_active_stripe+0x2839/0x3e00 [raid456] > [21876.864577] raid5_get_active_stripe+0x2d6e/0x3e00 [raid456] > [21876.864579] md_thread+0xae/0x170 > [21876.864581] ? wait_woken+0x60/0x60 > [21876.864582] ? md_start_sync+0x60/0x60 > [21876.864584] kthread+0x127/0x150 > [21876.864586] ? set_kthread_struct+0x40/0x40 > [21876.864588] ret_from_fork+0x1f/0x30 While there is something suspicious with raid10_handle_discard(), I don't think it is related to this error. I couldn't reproduce this in my tests. Thanks, Song