On Tue, Feb 27, 2024 at 10:38 AM Yu Kuai <yukuai1@xxxxxxxxxxxxxxx> wrote: > > Hi, > > 在 2024/02/27 10:23, Xiao Ni 写道: > > On Thu, Feb 22, 2024 at 4:04 PM Yu Kuai <yukuai1@xxxxxxxxxxxxxxx> wrote: > >> > >> From: Yu Kuai <yukuai3@xxxxxxxxxx> > >> > >> Commit 12cee5a8a29e ("md/raid1: prevent merging too large request") add > >> the case choose next idle in read_balance(): > >> > >> read_balance: > >> for_each_rdev > >> if(next_seq_sect == this_sector || disk == 0) > >> -> sequential reads > >> best_disk = disk; > >> if (...) > >> choose_next_idle = 1 > >> continue; > >> > >> for_each_rdev > >> -> iterate next rdev > >> if (pending == 0) > >> best_disk = disk; > >> -> choose the next idle disk > >> break; > >> > >> if (choose_next_idle) > >> -> keep using this rdev if there are no other idle disk > >> contine > >> > >> However, commit 2e52d449bcec ("md/raid1: add failfast handling for reads.") > >> remove the code: > >> > >> - /* If device is idle, use it */ > >> - if (pending == 0) { > >> - best_disk = disk; > >> - break; > >> - } > >> > >> Hence choose next idle will never work now, fix this problem by > >> following: > >> > >> 1) don't set best_disk in this case, read_balance() will choose the best > >> disk after iterating all the disks; > >> 2) add 'pending' so that other idle disk will be chosen; > >> 3) set 'dist' to 0 so that if there is no other idle disk, and all disks > >> are rotational, this disk will still be chosen; > >> > >> Fixes: 2e52d449bcec ("md/raid1: add failfast handling for reads.") > >> Co-developed-by: Paul Luse <paul.e.luse@xxxxxxxxxxxxxxx> > >> Signed-off-by: Paul Luse <paul.e.luse@xxxxxxxxxxxxxxx> > >> Signed-off-by: Yu Kuai <yukuai3@xxxxxxxxxx> > >> --- > >> drivers/md/raid1.c | 21 ++++++++++++--------- > >> 1 file changed, 12 insertions(+), 9 deletions(-) > >> > >> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c > >> index c60ea58ae8c5..d0bc67e6d068 100644 > >> --- a/drivers/md/raid1.c > >> +++ b/drivers/md/raid1.c > >> @@ -604,7 +604,6 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect > >> unsigned int min_pending; > >> struct md_rdev *rdev; > >> int choose_first; > >> - int choose_next_idle; > >> > >> /* > >> * Check if we can balance. We can balance on the whole > >> @@ -619,7 +618,6 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect > >> best_pending_disk = -1; > >> min_pending = UINT_MAX; > >> best_good_sectors = 0; > >> - choose_next_idle = 0; > >> clear_bit(R1BIO_FailFast, &r1_bio->state); > >> > >> if ((conf->mddev->recovery_cp < this_sector + sectors) || > >> @@ -712,7 +710,6 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect > >> int opt_iosize = bdev_io_opt(rdev->bdev) >> 9; > >> struct raid1_info *mirror = &conf->mirrors[disk]; > >> > >> - best_disk = disk; > >> /* > >> * If buffered sequential IO size exceeds optimal > >> * iosize, check if there is idle disk. If yes, choose > >> @@ -731,15 +728,21 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect > >> mirror->next_seq_sect > opt_iosize && > >> mirror->next_seq_sect - opt_iosize >= > >> mirror->seq_start) { > >> - choose_next_idle = 1; > >> - continue; > >> + /* > >> + * Add 'pending' to avoid choosing this disk if > >> + * there is other idle disk. > >> + * Set 'dist' to 0, so that if there is no other > >> + * idle disk and all disks are rotational, this > >> + * disk will still be chosen. > >> + */ > >> + pending++; > >> + dist = 0; > >> + } else { > >> + best_disk = disk; > >> + break; > >> } > >> - break; > >> } > > > > Hi Kuai > > > > I noticed something. In patch 12cee5a8a29e, it sets best_disk if it's > > a sequential read. If there are no other idle disks, it will read from > > the sequential disk. With this patch, it reads from the > > best_pending_disk even min_pending is not 0. It looks like a wrong > > behaviour? > > Yes, nice catch, I didn't notice this yet... So there is a hidden > logical, sequential IO priority is higher than minimal 'pending' > selection, it's only less than 'choose_next_idle' where idle disk > exist. Yes. > > Looks like if we want to keep this behaviour, we can add a 'sequential > disk': > > if (is_sequential()) > if (!should_choose_next()) > return disk; > ctl.sequential_disk = disk; > > ... > > if (ctl.min_pending != 0 && ctl.sequential_disk != -1) > return ctl.sequential_disk; Agree with this, thanks :) Best Regards Xiao > > Thanks, > Kuai > > > > > Best Regards > > Xiao > >> > >> - if (choose_next_idle) > >> - continue; > >> - > >> if (min_pending > pending) { > >> min_pending = pending; > >> best_pending_disk = disk; > >> -- > >> 2.39.2 > >> > >> > > > > . > > >