On 2023-04-20 14:07, Song Liu wrote: > On Thu, Apr 20, 2023 at 8:54 AM Logan Gunthorpe <logang@xxxxxxxxxxxx> wrote: >> >> >> >> On 2023-04-20 05:26, Jan Kara wrote: >>> On Wed 19-04-23 22:26:07, Song Liu wrote: >>>> On Wed, Apr 19, 2023 at 12:04 PM Logan Gunthorpe <logang@xxxxxxxxxxxx> wrote: >>>>> On 2023-04-17 11:15, Jan Kara wrote: >>>>>> Commit 7e55c60acfbb ("md/raid5: Pivot raid5_make_request()") changed the >>>>>> order in which requests for underlying disks are created. Since for >>>>>> large sequential IO adding of requests frequently races with md_raid5 >>>>>> thread submitting bios to underlying disks, this results in a change in >>>>>> IO pattern because intermediate states of new order of request creation >>>>>> result in more smaller discontiguous requests. For RAID5 on top of three >>>>>> rotational disks our performance testing revealed this results in >>>>>> regression in write throughput: >>>>>> >>>>>> iozone -a -s 131072000 -y 4 -q 8 -i 0 -i 1 -R >>>>>> >>>>>> before 7e55c60acfbb: >>>>>> KB reclen write rewrite read reread >>>>>> 131072000 4 493670 525964 524575 513384 >>>>>> 131072000 8 540467 532880 512028 513703 >>>>>> >>>>>> after 7e55c60acfbb: >>>>>> KB reclen write rewrite read reread >>>>>> 131072000 4 421785 456184 531278 509248 >>>>>> 131072000 8 459283 456354 528449 543834 >>>>>> >>>>>> To reduce the amount of discontiguous requests we can start generating >>>>>> requests with the stripe with the lowest chunk offset as that has the >>>>>> best chance of being adjacent to IO queued previously. This improves the >>>>>> performance to: >>>>>> KB reclen write rewrite read reread >>>>>> 131072000 4 497682 506317 518043 514559 >>>>>> 131072000 8 514048 501886 506453 504319 >>>>>> >>>>>> restoring big part of the regression. >>>>>> >>>>>> Fixes: 7e55c60acfbb ("md/raid5: Pivot raid5_make_request()") >>>>>> Signed-off-by: Jan Kara <jack@xxxxxxx> >>>>> >>>>> Looks good to me. I ran it through some of the functional tests I used >>>>> to develop the patch in question and found no issues. >>>>> >>>>> Reviewed-by: Logan Gunthorpe <logang@xxxxxxxxxxxx> >>>> >>>> Thanks Jan and Logan! I will apply this to md-next. But it may not make >>>> 6.4 release, as we are already at rc7. >>> >>> OK, sure, this is not a critical issue. >>> >>>>>> --- >>>>>> drivers/md/raid5.c | 45 ++++++++++++++++++++++++++++++++++++++++++++- >>>>>> 1 file changed, 44 insertions(+), 1 deletion(-) >>>>>> >>>>>> I'm by no means raid5 expert but this is what I was able to come up with. Any >>>>>> opinion or ideas how to fix the problem in a better way? >>>>> >>>>> The other option would be to revert to the old method for spinning disks >>>>> and use the pivot option only on SSDs. The pivot optimization really >>>>> only applies at IO speeds that can be achieved by fast solid state disks >>>>> anyway, and there isn't likely enough IOPS possible on spinning disks to >>>>> get enough lock contention that the optimization would provide any benefit. >>>>> >>>>> So it could make sense to just fall back to the old method of preparing >>>>> the stripes in logical block order if we are running on spinning disks. >>>>> Though, that might be a bit more involved than what this patch does. So >>>>> I think this is probably a good approach, unless we want to recover more >>>>> of the lost throughput. >>>> >>>> How about we only do the optimization in this patch for spinning disks? >>>> Something like: >>>> >>>> if (likely(conf->reshape_progress == MaxSector) && >>>> !blk_queue_nonrot(mddev->queue)) >>>> logical_sector = raid5_bio_lowest_chunk_sector(conf, bi); >>> >>> Yeah, makes sense. On SSD disks I was not able to observe any adverse >>> effects of the different stripe order. Will you update the patch or should >>> I respin it? >> >> Does it make sense? If SSDs work fine with the new stripe order, why do >> things different for them? So I don't see a benefit of making the fix >> only work with non-rotational devices. It's my original change which >> could be made for non-rotationatial only, but that's much more involved. > > I am hoping to make raid5_make_request() a little faster for non-rotational > devices. We may not easily observe a difference in performance, but things > add up. Does this make sense? I guess. But without a performance test that shows that it makes an improvement, I'm hesitant about that. It could also be that it helps a tiny bit for non-rotational disks, but we just don't know. Unfortunately, I don't have the time right now to do these performance tests. Logan