Re: Sequential writing to degraded RAID6 causing a lot of reading

Patrik Horník <patrik@xxxxxx> · Tue, 20 May 2014 12:07:11 +0200

2014-05-20 7:42 GMT+02:00 NeilBrown <neilb@xxxxxxx>:
> On Thu, 15 May 2014 09:50:49 +0200 Patrik Horník <patrik@xxxxxx> wrote:
>
>> OK, it seems that because of that my copy operations will not be
>> finished yet by next week... :)
>>
>> BTW this time layout is left-symetric but the problem I guess is in
>> whole strip' write detection with degraded RAID6.
>>
>> Patrik
>>
>> 2014-05-15 9:18 GMT+02:00 NeilBrown <neilb@xxxxxxx>:
>> > On Thu, 15 May 2014 09:04:27 +0200 Patrik Horník <patrik@xxxxxx> wrote:
>> >
>> >> Hello Neil,
>> >>
>> >> did you make some progress on this issue by any chance?
>> >
>> > No I haven't - sorry.
>> > After 2 year, I guess I really should.
>> >
>> > I'll make another note for first thing next week.
>
> Can you try the following patch and let me know if it helps?

I dont want to test it on production system... But I have some
degraded array which does not have production data on it so I will
think about how to test it.

> I definitely reduced the number of reads significantly, but my measurements
> (of a very simple test case) didn't show much speed-up.
>

I did not look at the patch itself but according to your description
is should eliminate the problem, should it not? What was your read /
write ratio after the patch?

Thanks.

Patrik

> This is against current mainline.  If you want it against another version and
> it doesn't apply easily, just ask.
>
> Thanks,
> NeilBrown
>
> From 98c411f93391be0dbda98d43835dd9e042faa78f Mon Sep 17 00:00:00 2001
> From: NeilBrown <neilb@xxxxxxx>
> Date: Mon, 19 May 2014 11:16:49 +1000
> Subject: [PATCH] md/raid56: Don't perform reads to support writes until stripe
>  is ready.
> MIME-Version: 1.0
> Content-Type: text/plain; charset=UTF-8
> Content-Transfer-Encoding: 8bit
>
> If it is found that we need to pre-read some blocks before a write
> can succeed, we normally set STRIPE_DELAYED and don't actually perform
> the read until STRIPE_PREREAD_ACTIVE subsequently gets set.
>
> However for a degraded RAID6 we currently perform the reads as soon
> as we see that a write is pending.  This significantly hurts
> throughput.
>
> So:
>  - when handle_stripe_dirtying find a block that it wants on a device
>    that is failed, set STRIPE_DELAY, instead of doing nothing, and
>  - when fetch_block detects that a read might be required to satisfy a
>    write, only perform the read if STRIPE_PREREAD_ACTIVE is set,
>    and if we would actually need to read something to complete the write.
>
> This also helps RAID5, though less often as RAID5 supports a
> read-modify-write cycle.  For RAID5 the read is performed too early
> only if the write is not a full 4K aligned write (i.e. no an
> R5_OVERWRITE).
>
> Also clean up a couple of horrible bits of formatting.
>
> Reported-by: Patrik Horník <patrik@xxxxxx>
> Signed-off-by: NeilBrown <neilb@xxxxxxx>
>
> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> index 633e20a96b34..d67202bd9118 100644
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
> @@ -292,9 +292,12 @@ static void do_release_stripe(struct r5conf *conf, struct stripe_head *sh,
>         BUG_ON(atomic_read(&conf->active_stripes)==0);
>         if (test_bit(STRIPE_HANDLE, &sh->state)) {
>                 if (test_bit(STRIPE_DELAYED, &sh->state) &&
> -                   !test_bit(STRIPE_PREREAD_ACTIVE, &sh->state))
> +                   !test_bit(STRIPE_PREREAD_ACTIVE, &sh->state)) {
>                         list_add_tail(&sh->lru, &conf->delayed_list);
> -               else if (test_bit(STRIPE_BIT_DELAY, &sh->state) &&
> +                       if (atomic_read(&conf->preread_active_stripes)
> +                           < IO_THRESHOLD)
> +                               md_wakeup_thread(conf->mddev->thread);
> +               } else if (test_bit(STRIPE_BIT_DELAY, &sh->state) &&
>                            sh->bm_seq - conf->seq_write > 0)
>                         list_add_tail(&sh->lru, &conf->bitmap_list);
>                 else {
> @@ -2908,8 +2911,11 @@ static int fetch_block(struct stripe_head *sh, struct stripe_head_state *s,
>              (s->failed >= 1 && fdev[0]->toread) ||
>              (s->failed >= 2 && fdev[1]->toread) ||
>              (sh->raid_conf->level <= 5 && s->failed && fdev[0]->towrite &&
> +             (!test_bit(R5_Insync, &dev->flags) || test_bit(STRIPE_PREREAD_ACTIVE, &sh->state)) &&
>               !test_bit(R5_OVERWRITE, &fdev[0]->flags)) ||
> -            (sh->raid_conf->level == 6 && s->failed && s->to_write))) {
> +            (sh->raid_conf->level == 6 && s->failed && s->to_write &&
> +             s->towrite < sh->raid_conf->raid_disks - 2 &&
> +             (!test_bit(R5_Insync, &dev->flags) || test_bit(STRIPE_PREREAD_ACTIVE, &sh->state))))) {
>                 /* we would like to get this block, possibly by computing it,
>                  * otherwise read it if the backing disk is insync
>                  */
> @@ -3115,7 +3121,8 @@ static void handle_stripe_dirtying(struct r5conf *conf,
>                     !test_bit(R5_LOCKED, &dev->flags) &&
>                     !(test_bit(R5_UPTODATE, &dev->flags) ||
>                     test_bit(R5_Wantcompute, &dev->flags))) {
> -                       if (test_bit(R5_Insync, &dev->flags)) rcw++;
> +                       if (test_bit(R5_Insync, &dev->flags))
> +                               rcw++;
>                         else
>                                 rcw += 2*disks;
>                 }
> @@ -3136,10 +3143,10 @@ static void handle_stripe_dirtying(struct r5conf *conf,
>                             !(test_bit(R5_UPTODATE, &dev->flags) ||
>                             test_bit(R5_Wantcompute, &dev->flags)) &&
>                             test_bit(R5_Insync, &dev->flags)) {
> -                               if (
> -                                 test_bit(STRIPE_PREREAD_ACTIVE, &sh->state)) {
> -                                       pr_debug("Read_old block "
> -                                                "%d for r-m-w\n", i);
> +                               if (test_bit(STRIPE_PREREAD_ACTIVE,
> +                                            &sh->state)) {
> +                                       pr_debug("Read_old block %d for r-m-w\n",
> +                                                i);
>                                         set_bit(R5_LOCKED, &dev->flags);
>                                         set_bit(R5_Wantread, &dev->flags);
>                                         s->locked++;
> @@ -3162,10 +3169,9 @@ static void handle_stripe_dirtying(struct r5conf *conf,
>                             !(test_bit(R5_UPTODATE, &dev->flags) ||
>                               test_bit(R5_Wantcompute, &dev->flags))) {
>                                 rcw++;
> -                               if (!test_bit(R5_Insync, &dev->flags))
> -                                       continue; /* it's a failed drive */
> -                               if (
> -                                 test_bit(STRIPE_PREREAD_ACTIVE, &sh->state)) {
> +                               if (test_bit(R5_Insync, &dev->flags) &&
> +                                   test_bit(STRIPE_PREREAD_ACTIVE,
> +                                            &sh->state)) {
>                                         pr_debug("Read_old block "
>                                                 "%d for Reconstruct\n", i);
>                                         set_bit(R5_LOCKED, &dev->flags);
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html