Re: [PATCH] md/raid5: don't do chunk aligned read on degraded array.

NeilBrown <neilb@xxxxxxx> · Mon, 20 Apr 2015 16:20:38 +1000

On Thu, 19 Mar 2015 13:41:16 -0600 Eric Mei <meijia@xxxxxxxxx> wrote:

> On 2015-03-19 12:02 AM, NeilBrown wrote:
> > On Wed, 18 Mar 2015 23:39:11 -0600 Eric Mei <meijia@xxxxxxxxx> wrote:
> >
> >> From: Eric Mei <eric.mei@xxxxxxxxxxx>
> >>
> >> When array is degraded, read data landed on failed drives will result in
> >> reading rest of data in a stripe. So a single sequential read would
> >> result in same data being read twice.
> >>
> >> This patch is to avoid chunk aligned read for degraded array. The
> >> downside is to involve stripe cache which means associated CPU overhead
> >> and extra memory copy.
> >>
> >> Signed-off-by: Eric Mei <eric.mei@xxxxxxxxxxx>
> >> ---
> >>    drivers/md/raid5.c |   15 ++++++++++++---
> >>    1 files changed, 12 insertions(+), 3 deletions(-)
> >>
> >> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> >> index cd2f96b..763c64a 100644
> >> --- a/drivers/md/raid5.c
> >> +++ b/drivers/md/raid5.c
> >> @@ -4180,8 +4180,12 @@ static int raid5_mergeable_bvec(struct mddev *mddev,
> >>           unsigned int chunk_sectors = mddev->chunk_sectors;
> >>           unsigned int bio_sectors = bvm->bi_size >> 9;
> >>
> >> -       if ((bvm->bi_rw & 1) == WRITE)
> >> -               return biovec->bv_len; /* always allow writes to be
> >> mergeable */
> >> +       /*
> >> +        * always allow writes to be mergeable, read as well if array
> >> +        * is degraded as we'll go through stripe cache anyway.
> >> +        */
> >> +       if ((bvm->bi_rw & 1) == WRITE || mddev->degraded)
> >> +               return biovec->bv_len;
> >>
> >>           if (mddev->new_chunk_sectors < mddev->chunk_sectors)
> >>                   chunk_sectors = mddev->new_chunk_sectors;
> >> @@ -4656,7 +4660,12 @@ static void make_request(struct mddev *mddev,
> >> struct bio * bi)
> >>
> >>           md_write_start(mddev, bi);
> >>
> >> -       if (rw == READ &&
> >> +       /*
> >> +        * If array is degraded, better not do chunk aligned read because
> >> +        * later we might have to read it again in order to reconstruct
> >> +        * data on failed drives.
> >> +        */
> >> +       if (rw == READ && mddev->degraded == 0 &&
> >>                mddev->reshape_position == MaxSector &&
> >>                chunk_aligned_read(mddev,bi))
> >>                   return;
> >
> > Thanks for the patch.
> >
> > However this sort of patch really needs to come with some concrete
> > performance numbers.  Preferably both sequential reads and random reads.
> >
> > I agree that sequential reads are likely to be faster, but how much faster
> > are they?
> > I imagine that this might make random reads a little slower.   Does it?  By
> > how much?
> >
> > Thanks,
> > NeilBrown
> >
> 
> Hi Neil,
> 
> Sorry I should have done the test in first place.
> 
> Following test are done on a enterprise storage node with Seagate 6T SAS 
> drives and Xeon E5-2648L CPU (10 cores, 1.9Ghz), 10 disks MD RAID6 8+2, 
> chunk size 128 KiB.
> 
> I use FIO, using direct-io with various bs size, enough queue depth, 
> tested sequential and 100% random read against 3 array config: 1) 
> optimal, as baseline; 2) degraded; 3) degraded with this patch. Kernel 
> version is 4.0-rc3.
> 
> Each individual test I only did once so there might be some variations, 
> but we just focus on big trend.
> 
> Sequential Read:
>   bs=(KiB)  optimal(MiB/s)  degraded(MiB/s)  degraded-with-patch (MiB/s)
>    1024       1608            656              995
>     512       1624            710              956
>     256       1635            728              980
>     128       1636            771              983
>      64       1612           1119             1000
>      32       1580           1420             1004
>      16       1368            688              986
>       8        768            647              953
>       4        411            413              850
> 
> Random Read:
>   bs=(KiB)  optimal(IOPS)  degraded(IOPS)  degraded-with-patch (IOPS)
>    1024        163            160              156
>     512        274            273              272
>     256        426            428              424
>     128        576            592              591
>      64        726            724              726
>      32        849            848              837
>      16        900            970              971
>       8        927            940              929
>       4        948            940              955
> 
> Some notes:
>   * In sequential + optimal, as bs size getting smaller, the FIO thread 
> become CPU bound.
>   * In sequential + degraded, there's big increase when bs is 64K and 
> 32K, I don't have explanation.
>   * In sequential + degraded-with-patch, the MD thread mostly become CPU 
> bound.
> 
> If you want to we can discuss specific data point in those data. But in 
> general it seems with this patch, we have more predictable and in most 
> cases significant better sequential read performance when array is 
> degraded, and almost no noticeable impact on random read.
> 
> Performance is a complicated thing, the patch works well for this 
> particular configuration, but may not be universal. For example I 
> imagine testing on all SSD array may have very different result. But I 
> personally think in most cases IO bandwidth is more scarce resource than 
> CPU.
> 
> Eric

Thanks.
That is reasonably convincing.

I've added that text to the commit message, fixed up all the white-space
damage in the patch (tabs were converted to spaces etc ... if you are going
to be sending more patches, please find a way to convince your mailer that
spaces are important), and applied it.

It should be included in my pull request for 4.1

Thanks,
NeilBrown
Attachment:
pgp90YFoM68pZ.pgp

Description: OpenPGP digital signature