On 12/13/21 3:43 PM, Vishal Verma wrote: > > On 12/12/21 10:56 PM, Song Liu wrote: >> On Fri, Dec 10, 2021 at 10:26 AM Vishal Verma <vverma@xxxxxxxxxxxxxxxx> wrote: >>> >>> On 12/9/21 7:16 PM, Song Liu wrote: >>>> On Wed, Nov 10, 2021 at 10:15 AM Vishal Verma <vverma@xxxxxxxxxxxxxxxx> wrote: >>>>> Returns EAGAIN in case the raid456 driver would block >>>>> waiting for situations like: >>>>> >>>>> - Reshape operation, >>>>> - Discard operation. >>>>> >>>>> Signed-off-by: Vishal Verma <vverma@xxxxxxxxxxxxxxxx> >>>>> --- >>>>> drivers/md/raid5.c | 14 ++++++++++++++ >>>>> 1 file changed, 14 insertions(+) >>>>> >>>>> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c >>>>> index 9c1a5877cf9f..fa64ee315241 100644 >>>>> --- a/drivers/md/raid5.c >>>>> +++ b/drivers/md/raid5.c >>>>> @@ -5710,6 +5710,11 @@ static void make_discard_request(struct mddev *mddev, struct bio *bi) >>>>> int d; >>>>> again: >>>>> sh = raid5_get_active_stripe(conf, logical_sector, 0, 0, 0); >>>>> + /* Bail out if REQ_NOWAIT is set */ >>>>> + if (bi->bi_opf & REQ_NOWAIT) { >>>>> + bio_wouldblock_error(bi); >>>>> + return; >>>>> + } >>>> This is not right. raid5_get_active_stripe() gets refcount on the sh, >>>> we cannot simply >>>> return here. I think we need the logic after raid5_release_stripe() >>>> and before schedule(). >>>> >>>>> prepare_to_wait(&conf->wait_for_overlap, &w, >>>>> TASK_UNINTERRUPTIBLE); >>>>> set_bit(R5_Overlap, &sh->dev[sh->pd_idx].flags); >>>>> @@ -5820,6 +5825,15 @@ static bool raid5_make_request(struct mddev *mddev, struct bio * bi) >>>>> bi->bi_next = NULL; >>>>> >>>>> md_account_bio(mddev, &bi); >>>>> + /* Bail out if REQ_NOWAIT is set */ >>>>> + if (bi->bi_opf & REQ_NOWAIT && >>>>> + conf->reshape_progress != MaxSector && >>>>> + mddev->reshape_backwards >>>>> + ? logical_sector < conf->reshape_safe >>>>> + : logical_sector >= conf->reshape_safe) { >>>>> + bio_wouldblock_error(bi); >>>>> + return true; >>>>> + } >>>> This is also problematic, and is the trigger of those error messages. >>>> We only want to trigger -EAGAIN when logical_sector is between >>>> reshape_progress and reshape_safe. >>>> >>>> Just to clarify, did you mean doing something like: >>>> if (bi->bi_opf & REQ_NOWAIT && >>>> + conf->reshape_progress != MaxSector && >>>> + (mddev->reshape_backwards >>>> + ? (logical_sector > conf->reshape_progress && logical_sector < conf->reshape_safe) >>>> + : logical_sector >= conf->reshape_safe)) { >> I think this should be >> : (logical_sector >= conf->reshape_safe && logical_sector < >> conf->reshape_progress) > > > if (bi->bi_opf & REQ_NOWAIT && > conf->reshape_progress != MaxSector && > (mddev->reshape_backwards > ? (logical_sector > conf->reshape_progress && > logical_sector <= conf->reshape_safe) > : (logical_sector >= conf->reshape_safe && > logical_sector < conf->reshape_progress))) { > bio_wouldblock_error(bi); > return true; > } > > After making this change along with other changes, I ran some tests with > 100% reads, 70%read30%writes and 100% writes on a clean raid5 array. > > Unfortunately, I ran into this following task hung with 100% writes > (with both libaio and io_uring): > > [21876.856692] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" > disables this message. > [21876.864518] task:md5_raid5 state:D stack: 0 pid:11675 > ppid: 2 flags:0x00004000 > [21876.864522] Call Trace: > [21876.864526] __schedule+0x2d4/0x970 > [21876.864532] ? wbt_cleanup_cb+0x20/0x20 > [21876.864535] schedule+0x4e/0xb0 > [21876.864537] io_schedule+0x3f/0x70 > [21876.864539] rq_qos_wait+0xb9/0x130 > [21876.864542] ? sysv68_partition+0x280/0x280 > [21876.864543] ? wbt_cleanup_cb+0x20/0x20 > [21876.864545] wbt_wait+0x92/0xc0 > [21876.864546] __rq_qos_throttle+0x25/0x40 > [21876.864548] blk_mq_submit_bio+0xc6/0x5d0 > [21876.864551] ? submit_bio_checks+0x39e/0x5f0 > [21876.864554] __submit_bio+0x1bc/0x1d0 > [21876.864555] ? kmem_cache_free+0x378/0x3c0 > [21876.864558] ? mempool_free_slab+0x17/0x20 > [21876.864562] submit_bio_noacct+0x256/0x2a0 > [21876.864565] 0xffffffffc01fa6d9 > [21876.864568] ? 0xffffffffc01f5d01 > [21876.864569] raid5_get_active_stripe+0x16c0/0x3e00 [raid456] > [21876.864571] ? __wake_up_common_lock+0x8a/0xc0 > [21876.864575] raid5_get_active_stripe+0x2839/0x3e00 [raid456] > [21876.864577] raid5_get_active_stripe+0x2d6e/0x3e00 [raid456] > [21876.864579] md_thread+0xae/0x170 > [21876.864581] ? wait_woken+0x60/0x60 > [21876.864582] ? md_start_sync+0x60/0x60 > [21876.864584] kthread+0x127/0x150 > [21876.864586] ? set_kthread_struct+0x40/0x40 > [21876.864588] ret_from_fork+0x1f/0x30 What kernel base are you using for your patches? -- Jens Axboe