Re: [PATCH] md/raid5: Improve performance for sequential IO

Song Liu <song@xxxxxxxxxx> · Thu, 20 Apr 2023 13:07:05 -0700

On Thu, Apr 20, 2023 at 8:54 AM Logan Gunthorpe <logang@xxxxxxxxxxxx> wrote:
>
>
>
> On 2023-04-20 05:26, Jan Kara wrote:
> > On Wed 19-04-23 22:26:07, Song Liu wrote:
> >> On Wed, Apr 19, 2023 at 12:04 PM Logan Gunthorpe <logang@xxxxxxxxxxxx> wrote:
> >>> On 2023-04-17 11:15, Jan Kara wrote:
> >>>> Commit 7e55c60acfbb ("md/raid5: Pivot raid5_make_request()") changed the
> >>>> order in which requests for underlying disks are created. Since for
> >>>> large sequential IO adding of requests frequently races with md_raid5
> >>>> thread submitting bios to underlying disks, this results in a change in
> >>>> IO pattern because intermediate states of new order of request creation
> >>>> result in more smaller discontiguous requests. For RAID5 on top of three
> >>>> rotational disks our performance testing revealed this results in
> >>>> regression in write throughput:
> >>>>
> >>>> iozone -a -s 131072000 -y 4 -q 8 -i 0 -i 1 -R
> >>>>
> >>>> before 7e55c60acfbb:
> >>>>               KB  reclen   write rewrite    read    reread
> >>>>        131072000       4  493670  525964   524575   513384
> >>>>        131072000       8  540467  532880   512028   513703
> >>>>
> >>>> after 7e55c60acfbb:
> >>>>               KB  reclen   write rewrite    read    reread
> >>>>        131072000       4  421785  456184   531278   509248
> >>>>        131072000       8  459283  456354   528449   543834
> >>>>
> >>>> To reduce the amount of discontiguous requests we can start generating
> >>>> requests with the stripe with the lowest chunk offset as that has the
> >>>> best chance of being adjacent to IO queued previously. This improves the
> >>>> performance to:
> >>>>               KB  reclen   write rewrite    read    reread
> >>>>        131072000       4  497682  506317   518043   514559
> >>>>        131072000       8  514048  501886   506453   504319
> >>>>
> >>>> restoring big part of the regression.
> >>>>
> >>>> Fixes: 7e55c60acfbb ("md/raid5: Pivot raid5_make_request()")
> >>>> Signed-off-by: Jan Kara <jack@xxxxxxx>
> >>>
> >>> Looks good to me. I ran it through some of the functional tests I used
> >>> to develop the patch in question and found no issues.
> >>>
> >>> Reviewed-by: Logan Gunthorpe <logang@xxxxxxxxxxxx>
> >>
> >> Thanks Jan and Logan! I will apply this to md-next. But it may not make
> >> 6.4 release, as we are already at rc7.
> >
> > OK, sure, this is not a critical issue.
> >
> >>>> ---
> >>>>  drivers/md/raid5.c | 45 ++++++++++++++++++++++++++++++++++++++++++++-
> >>>>  1 file changed, 44 insertions(+), 1 deletion(-)
> >>>>
> >>>> I'm by no means raid5 expert but this is what I was able to come up with. Any
> >>>> opinion or ideas how to fix the problem in a better way?
> >>>
> >>> The other option would be to revert to the old method for spinning disks
> >>> and use the pivot option only on SSDs. The pivot optimization really
> >>> only applies at IO speeds that can be achieved by fast solid state disks
> >>> anyway, and there isn't likely enough IOPS possible on spinning disks to
> >>> get enough lock contention that the optimization would provide any benefit.
> >>>
> >>> So it could make sense to just fall back to the old method of preparing
> >>> the stripes in logical block order if we are running on spinning disks.
> >>> Though, that might be a bit more involved than what this patch does. So
> >>> I think this is probably a good approach, unless we want to recover more
> >>> of the lost throughput.
> >>
> >> How about we only do the optimization in this patch for spinning disks?
> >> Something like:
> >>
> >>         if (likely(conf->reshape_progress == MaxSector) &&
> >>             !blk_queue_nonrot(mddev->queue))
> >>                 logical_sector = raid5_bio_lowest_chunk_sector(conf, bi);
> >
> > Yeah, makes sense. On SSD disks I was not able to observe any adverse
> > effects of the different stripe order. Will you update the patch or should
> > I respin it?
>
> Does it make sense? If SSDs work fine with the new stripe order, why do
> things different for them? So I don't see a  benefit of making the fix
> only work with non-rotational devices. It's my original change which
> could be made for non-rotationatial only, but that's much more involved.

I am hoping to make raid5_make_request() a little faster for non-rotational
devices. We may not easily observe a difference in performance, but things
add up. Does this make sense?

Thanks,
Song