On Mon, Feb 26, 2024 at 5:40 PM Yu Kuai <yukuai1@xxxxxxxxxxxxxxx> wrote: > > Hi, > > 在 2024/02/26 17:24, Xiao Ni 写道: > > On Mon, Feb 26, 2024 at 5:12 PM Yu Kuai <yukuai1@xxxxxxxxxxxxxxx> wrote: > >> > >> Hi, > >> > >> 在 2024/02/26 16:55, Xiao Ni 写道: > >>> Hi Kuai > >>> > >>> Thanks for the effort! > >>> > >>> On Thu, Feb 22, 2024 at 4:04 PM Yu Kuai <yukuai1@xxxxxxxxxxxxxxx> wrote: > >>>> > >>>> From: Yu Kuai <yukuai3@xxxxxxxxxx> > >>>> > >>>> Commit 12cee5a8a29e ("md/raid1: prevent merging too large request") add > >>>> the case choose next idle in read_balance(): > >>>> > >>>> read_balance: > >>>> for_each_rdev > >>>> if(next_seq_sect == this_sector || disk == 0) > >>> > >>> typo error: s/disk/dist/g > >>> > >>>> -> sequential reads > >>>> best_disk = disk; > >>>> if (...) > >>>> choose_next_idle = 1 > >>>> continue; > >>>> > >>>> for_each_rdev > >>>> -> iterate next rdev > >>>> if (pending == 0) > >>>> best_disk = disk; > >>>> -> choose the next idle disk > >>>> break; > >>>> > >>>> if (choose_next_idle) > >>>> -> keep using this rdev if there are no other idle disk > >>>> continue > >>>> > >>>> However, commit 2e52d449bcec ("md/raid1: add failfast handling for reads.") > >>>> remove the code: > >>>> > >>>> - /* If device is idle, use it */ > >>>> - if (pending == 0) { > >>>> - best_disk = disk; > >>>> - break; > >>>> - } > >>>> > >>>> Hence choose next idle will never work now, fix this problem by > >>>> following: > >>>> > >>>> 1) don't set best_disk in this case, read_balance() will choose the best > >>>> disk after iterating all the disks; > >>>> 2) add 'pending' so that other idle disk will be chosen; > >>>> 3) set 'dist' to 0 so that if there is no other idle disk, and all disks > >>>> are rotational, this disk will still be chosen; > >>>> > >>>> Fixes: 2e52d449bcec ("md/raid1: add failfast handling for reads.") > >>>> Co-developed-by: Paul Luse <paul.e.luse@xxxxxxxxxxxxxxx> > >>>> Signed-off-by: Paul Luse <paul.e.luse@xxxxxxxxxxxxxxx> > >>>> Signed-off-by: Yu Kuai <yukuai3@xxxxxxxxxx> > >>>> --- > >>>> drivers/md/raid1.c | 21 ++++++++++++--------- > >>>> 1 file changed, 12 insertions(+), 9 deletions(-) > >>>> > >>>> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c > >>>> index c60ea58ae8c5..d0bc67e6d068 100644 > >>>> --- a/drivers/md/raid1.c > >>>> +++ b/drivers/md/raid1.c > >>>> @@ -604,7 +604,6 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect > >>>> unsigned int min_pending; > >>>> struct md_rdev *rdev; > >>>> int choose_first; > >>>> - int choose_next_idle; > >>>> > >>>> /* > >>>> * Check if we can balance. We can balance on the whole > >>>> @@ -619,7 +618,6 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect > >>>> best_pending_disk = -1; > >>>> min_pending = UINT_MAX; > >>>> best_good_sectors = 0; > >>>> - choose_next_idle = 0; > >>>> clear_bit(R1BIO_FailFast, &r1_bio->state); > >>>> > >>>> if ((conf->mddev->recovery_cp < this_sector + sectors) || > >>>> @@ -712,7 +710,6 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect > >>>> int opt_iosize = bdev_io_opt(rdev->bdev) >> 9; > >>>> struct raid1_info *mirror = &conf->mirrors[disk]; > >>>> > >>>> - best_disk = disk; > >>>> /* > >>>> * If buffered sequential IO size exceeds optimal > >>>> * iosize, check if there is idle disk. If yes, choose > >>>> @@ -731,15 +728,21 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect > >>>> mirror->next_seq_sect > opt_iosize && > >>>> mirror->next_seq_sect - opt_iosize >= > >>>> mirror->seq_start) { > >>>> - choose_next_idle = 1; > >>>> - continue; > >>>> + /* > >>>> + * Add 'pending' to avoid choosing this disk if > >>>> + * there is other idle disk. > >>>> + * Set 'dist' to 0, so that if there is no other > >>>> + * idle disk and all disks are rotational, this > >>>> + * disk will still be chosen. > >>>> + */ > >>>> + pending++; > >>>> + dist = 0; > >>> > >>> There is a problem. If all disks are not idle and there is a disk with > >>> dist=0 before the seq disk, it can't read from the seq disk. It will > >>> read from the first disk with dist=0. Maybe we can only add the codes > >>> which are removed from 2e52d449bcec? > >> > >> If there is a disk with disk=0, then best_dist_disk will be updated to > >> the disk, and best_dist will be updated to 0 already: > >> > >> // in each iteration > >> if (dist < best_dist) { > >> best_dist = dist; > >> btest_disk_disk = disk; > >> } > >> > >> In this case, best_dist will be set to the first disk with dist=0, and > >> at last, the disk will be chosen: > >> > >> if (best_disk == -1) { > >> if (has_nonrot_disk || min_pending == 0) > >> best_disk = best_pending_disk; > >> else > >> best_disk = best_dist_disk; > >> -> the first disk with dist=0; > >> } > >> > >> So, the problem that you concerned should not exist. > > > > Hi Kuai > > > > Thanks for the explanation. You're right. It chooses the first disk > > which has dist==0. In the above, you made the same typo error disk=0 > > as the comment. I guess you want to use dist=0, right? Beside this, > > this patch is good to me. > > Yes, and Paul change the name 'best_dist' to 'closest_dist_disk', > and 'btest_disk_disk' to 'closest_dist' in the last patch to avoid typo > like this. :) Ah, thanks :) I haven't been there. Regards Xiao > > Thanks, > Kuai > > > > > > Best Regards > > Xiao > >> > >> Thanks, > >> Kuai > >>> > >>> Best Regards > >>> Xiao > >>> > >>>> + } else { > >>>> + best_disk = disk; > >>>> + break; > >>>> } > >>>> - break; > >>>> } > >>>> > >>>> - if (choose_next_idle) > >>>> - continue; > >>>> - > >>>> if (min_pending > pending) { > >>>> min_pending = pending; > >>>> best_pending_disk = disk; > >>>> -- > >>>> 2.39.2 > >>>> > >>>> > >>> > >>> . > >>> > >> > > > > . > > >