> On Feb 26, 2024, at 9:49 PM, Xiao Ni <xni@xxxxxxxxxx> wrote: > > On Tue, Feb 27, 2024 at 10:38 AM Yu Kuai <yukuai1@xxxxxxxxxxxxxxx> wrote: >> >> Hi, >> >> 在 2024/02/27 10:23, Xiao Ni 写道: >>> On Thu, Feb 22, 2024 at 4:04 PM Yu Kuai <yukuai1@xxxxxxxxxxxxxxx> wrote: >>>> >>>> From: Yu Kuai <yukuai3@xxxxxxxxxx> >>>> >>>> Commit 12cee5a8a29e ("md/raid1: prevent merging too large request") add >>>> the case choose next idle in read_balance(): >>>> >>>> read_balance: >>>> for_each_rdev >>>> if(next_seq_sect == this_sector || disk == 0) >>>> -> sequential reads >>>> best_disk = disk; >>>> if (...) >>>> choose_next_idle = 1 >>>> continue; >>>> >>>> for_each_rdev >>>> -> iterate next rdev >>>> if (pending == 0) >>>> best_disk = disk; >>>> -> choose the next idle disk >>>> break; >>>> >>>> if (choose_next_idle) >>>> -> keep using this rdev if there are no other idle disk >>>> contine >>>> >>>> However, commit 2e52d449bcec ("md/raid1: add failfast handling for reads.") >>>> remove the code: >>>> >>>> - /* If device is idle, use it */ >>>> - if (pending == 0) { >>>> - best_disk = disk; >>>> - break; >>>> - } >>>> >>>> Hence choose next idle will never work now, fix this problem by >>>> following: >>>> >>>> 1) don't set best_disk in this case, read_balance() will choose the best >>>> disk after iterating all the disks; >>>> 2) add 'pending' so that other idle disk will be chosen; >>>> 3) set 'dist' to 0 so that if there is no other idle disk, and all disks >>>> are rotational, this disk will still be chosen; >>>> >>>> Fixes: 2e52d449bcec ("md/raid1: add failfast handling for reads.") >>>> Co-developed-by: Paul Luse <paul.e.luse@xxxxxxxxxxxxxxx> >>>> Signed-off-by: Paul Luse <paul.e.luse@xxxxxxxxxxxxxxx> >>>> Signed-off-by: Yu Kuai <yukuai3@xxxxxxxxxx> >>>> --- >>>> drivers/md/raid1.c | 21 ++++++++++++--------- >>>> 1 file changed, 12 insertions(+), 9 deletions(-) >>>> >>>> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c >>>> index c60ea58ae8c5..d0bc67e6d068 100644 >>>> --- a/drivers/md/raid1.c >>>> +++ b/drivers/md/raid1.c >>>> @@ -604,7 +604,6 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect >>>> unsigned int min_pending; >>>> struct md_rdev *rdev; >>>> int choose_first; >>>> - int choose_next_idle; >>>> >>>> /* >>>> * Check if we can balance. We can balance on the whole >>>> @@ -619,7 +618,6 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect >>>> best_pending_disk = -1; >>>> min_pending = UINT_MAX; >>>> best_good_sectors = 0; >>>> - choose_next_idle = 0; >>>> clear_bit(R1BIO_FailFast, &r1_bio->state); >>>> >>>> if ((conf->mddev->recovery_cp < this_sector + sectors) || >>>> @@ -712,7 +710,6 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect >>>> int opt_iosize = bdev_io_opt(rdev->bdev) >> 9; >>>> struct raid1_info *mirror = &conf->mirrors[disk]; >>>> >>>> - best_disk = disk; >>>> /* >>>> * If buffered sequential IO size exceeds optimal >>>> * iosize, check if there is idle disk. If yes, choose >>>> @@ -731,15 +728,21 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect >>>> mirror->next_seq_sect > opt_iosize && >>>> mirror->next_seq_sect - opt_iosize >= >>>> mirror->seq_start) { >>>> - choose_next_idle = 1; >>>> - continue; >>>> + /* >>>> + * Add 'pending' to avoid choosing this disk if >>>> + * there is other idle disk. >>>> + * Set 'dist' to 0, so that if there is no other >>>> + * idle disk and all disks are rotational, this >>>> + * disk will still be chosen. >>>> + */ >>>> + pending++; >>>> + dist = 0; >>>> + } else { >>>> + best_disk = disk; >>>> + break; >>>> } >>>> - break; >>>> } >>> >>> Hi Kuai >>> >>> I noticed something. In patch 12cee5a8a29e, it sets best_disk if it's >>> a sequential read. If there are no other idle disks, it will read from >>> the sequential disk. With this patch, it reads from the >>> best_pending_disk even min_pending is not 0. It looks like a wrong >>> behaviour? >> >> Yes, nice catch, I didn't notice this yet... So there is a hidden >> logical, sequential IO priority is higher than minimal 'pending' >> selection, it's only less than 'choose_next_idle' where idle disk >> exist. > > Yes. > > >> >> Looks like if we want to keep this behaviour, we can add a 'sequential >> disk': >> >> if (is_sequential()) >> if (!should_choose_next()) >> return disk; >> ctl.sequential_disk = disk; >> >> ... >> >> if (ctl.min_pending != 0 && ctl.sequential_disk != -1) >> return ctl.sequential_disk; > > Agree with this, thanks :) > > Best Regards > Xiao Yup, agree as well. This will help for sure with the followup to this series for seq read improvements :) >> >> Thanks, >> Kuai >> >>> >>> Best Regards >>> Xiao >>>> >>>> - if (choose_next_idle) >>>> - continue; >>>> - >>>> if (min_pending > pending) { >>>> min_pending = pending; >>>> best_pending_disk = disk; >>>> -- >>>> 2.39.2 >>>> >>>> >>> >>> . >>> >> > >