On 2023-04-17 11:15, Jan Kara wrote: > Commit 7e55c60acfbb ("md/raid5: Pivot raid5_make_request()") changed the > order in which requests for underlying disks are created. Since for > large sequential IO adding of requests frequently races with md_raid5 > thread submitting bios to underlying disks, this results in a change in > IO pattern because intermediate states of new order of request creation > result in more smaller discontiguous requests. For RAID5 on top of three > rotational disks our performance testing revealed this results in > regression in write throughput: > > iozone -a -s 131072000 -y 4 -q 8 -i 0 -i 1 -R > > before 7e55c60acfbb: > KB reclen write rewrite read reread > 131072000 4 493670 525964 524575 513384 > 131072000 8 540467 532880 512028 513703 > > after 7e55c60acfbb: > KB reclen write rewrite read reread > 131072000 4 421785 456184 531278 509248 > 131072000 8 459283 456354 528449 543834 > > To reduce the amount of discontiguous requests we can start generating > requests with the stripe with the lowest chunk offset as that has the > best chance of being adjacent to IO queued previously. This improves the > performance to: > KB reclen write rewrite read reread > 131072000 4 497682 506317 518043 514559 > 131072000 8 514048 501886 506453 504319 > > restoring big part of the regression. > > Fixes: 7e55c60acfbb ("md/raid5: Pivot raid5_make_request()") > Signed-off-by: Jan Kara <jack@xxxxxxx> Looks good to me. I ran it through some of the functional tests I used to develop the patch in question and found no issues. Reviewed-by: Logan Gunthorpe <logang@xxxxxxxxxxxx> > --- > drivers/md/raid5.c | 45 ++++++++++++++++++++++++++++++++++++++++++++- > 1 file changed, 44 insertions(+), 1 deletion(-) > > I'm by no means raid5 expert but this is what I was able to come up with. Any > opinion or ideas how to fix the problem in a better way? The other option would be to revert to the old method for spinning disks and use the pivot option only on SSDs. The pivot optimization really only applies at IO speeds that can be achieved by fast solid state disks anyway, and there isn't likely enough IOPS possible on spinning disks to get enough lock contention that the optimization would provide any benefit. So it could make sense to just fall back to the old method of preparing the stripes in logical block order if we are running on spinning disks. Though, that might be a bit more involved than what this patch does. So I think this is probably a good approach, unless we want to recover more of the lost throughput. Logan