Re: raid1_end_read_request does not retry failed READ from a recovering drive

Alexander Lyakas <alex.bolshoy@xxxxxxxxx> · Wed, 22 Jul 2015 18:10:31 +0200

Hi Neil,
In continuation of our discussion, I see that you have added a
commit[1], which has a diff[2].
But this is only the second part of the fix. We also need the first
part, I believe, where in raid1_end_read_request we need to replace
"!test_bit(Faulty)" with "test_bit(In_sync)". Otherwise, we will never
retry the READ, and thus will never reach the fix_read_error code.
And we also need to "put all of raid1_spare_active inside the
spinlock", like you advised.
Do you agree?

We tested both parts of the fix, not the second part alone. Quoting myself:
"With this addition, the problem appears to be fixed", i.e., I meant
we applied both parts.

I am sorry for catching this so late. I never looked at what you
applied until now, because we are moving to kernel 3.18 long term, and
I am checking the raid1 changes. I just assumed you applied both
parts.

If you agree, it would be good if you tag the first part of the fix as
"cc stable" too.

Thanks,
Alex.



[1]
commit b8cb6b4c121e1bf1963c16ed69e7adcb1bc301cd
Author: NeilBrown <neilb@xxxxxxx>
Date:   Thu Sep 18 11:09:04 2014 +1000

    md/raid1: fix_read_error should act on all non-faulty devices.

    If a devices is being recovered it is not InSync and is not Faulty.

    If a read error is experienced on that device, fix_read_error()
    will be called, but it ignores non-InSync devices.  So it will
    neither fix the error nor fail the device.

    It is incorrect that fix_read_error() ignores non-InSync devices.
    It should only ignore Faulty devices.  So fix it.

    This became a bug when we allowed reading from a device that was being
    recovered.  It is suitable for any subsequent -stable kernel.

    Fixes: da8840a747c0dbf49506ec906757a6b87b9741e9
    Cc: stable@xxxxxxxxxxxxxxx (v3.5+)
    Reported-by: Alexander Lyakas <alex.bolshoy@xxxxxxxxx>
    Tested-by: Alexander Lyakas <alex.bolshoy@xxxxxxxxx>
    Signed-off-by: NeilBrown <neilb@xxxxxxx>

[2]

diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 35649dd..55de4f6 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -2155,7 +2155,7 @@ static void fix_read_error(struct r1conf *conf,
int read_disk,
                        d--;
                        rdev = conf->mirrors[d].rdev;
                        if (rdev &&
-                           test_bit(In_sync, &rdev->flags))
+                           !test_bit(Faulty, &rdev->flags))
                                r1_sync_page_io(rdev, sect, s,
                                                conf->tmppage, WRITE);
                }
@@ -2167,7 +2167,7 @@ static void fix_read_error(struct r1conf *conf,
int read_disk,
                        d--;
                        rdev = conf->mirrors[d].rdev;
                        if (rdev &&
-                           test_bit(In_sync, &rdev->flags)) {
+                           !test_bit(Faulty, &rdev->flags)) {
                                if (r1_sync_page_io(rdev, sect, s,
                                                    conf->tmppage, READ)) {
                                        atomic_add(s, &rdev->corrected_errors);

On Mon, Sep 22, 2014 at 2:17 AM, NeilBrown <neilb@xxxxxxx> wrote:
> On Sun, 21 Sep 2014 19:47:14 +0300 Alexander Lyakas <alex.bolshoy@xxxxxxxxx>
> wrote:
>
>> Thanks, Neil,
>>
>> On Thu, Sep 18, 2014 at 4:05 AM, NeilBrown <neilb@xxxxxxx> wrote:
>> > On Wed, 17 Sep 2014 20:57:13 +0300 Alexander Lyakas <alex.bolshoy@xxxxxxxxx>
>> > wrote:
>> >
>> >> Hi Neil,
>> >>
>> >> On Mon, Sep 8, 2014 at 10:17 AM, NeilBrown <neilb@xxxxxxx> wrote:
>> >> > On Sun, 7 Sep 2014 17:18:16 +0300 Alexander Lyakas <alex.bolshoy@xxxxxxxxx>
>> >> > wrote:
>> >> >
>> >> >> Hi Neil,
>> >> >> we see the following issue:
>> >> >>
>> >> >> # RAID1 has 2 drives A and B, drive B is recovering
>> >> >> # READ request arrives
>> >> >> # read_balanace selects drive B to read from, because READ sector
>> >> >> comes before B->recovery_offset
>> >> >> # READ is issued to drive B, but fails (drive B fails again)
>> >> >>
>> >> >> Now raid1_end_read_request() has the following code:
>> >> >>
>> >> >>     if (uptodate)
>> >> >>         set_bit(R1BIO_Uptodate, &r1_bio->state);
>> >> >>     else {
>> >> >>         /* If all other devices have failed, we want to return
>> >> >>          * the error upwards rather than fail the last device.
>> >> >>          * Here we redefine "uptodate" to mean "Don't want to retry"
>> >> >>          */
>> >> >>         unsigned long flags;
>> >> >>         spin_lock_irqsave(&conf->device_lock, flags);
>> >> >>         if (r1_bio->mddev->degraded == conf->raid_disks ||
>> >> >>             (r1_bio->mddev->degraded == conf->raid_disks-1 &&
>> >> >>              !test_bit(Faulty, &conf->mirrors[mirror].rdev->flags)))
>> >> >>             uptodate = 1;
>> >> >>         spin_unlock_irqrestore(&conf->device_lock, flags);
>> >> >>     }
>> >> >>
>> >> >> According to this code uptodate wrongly becomes 1, because:
>> >> >> r1_bio->mddev->degraded == conf->raid_disks-1 is TRUE
>> >> >> and
>> >> >> !test_bit(Faulty, &conf->mirrors[mirror].rdev->flags) is also TRUE
>> >> >>
>> >> >> Indeed, drive B is not marked as Faulty, but also not marked as In_sync.
>> >> >> However, this function treats !Faulty being equal to In_Sync, so it
>> >> >> decides that the last good drive failed, so it does not retry the
>> >> >> READ.
>> >> >>
>> >> >> As a result, there is IO error, while we should have retried the READ
>> >> >> from the healthy drive.
>> >> >>
>> >> >> This is happening in 3.8.13, but your master branch seems to have the
>> >> >> same issue.
>> >> >>
>> >> >> What is a reasonable fix?
>> >> >> 1) Do not read from drives which are !In_sync (a bit scary to read
>> >> >> from such drive)
>> >> >
>> >> > It is perfectly safe to read from a !In_sync device providing you are before
>> >> > ->recovery_offset.
>> >> >
>> >> >
>> >> >> 2) replace !Faulty to In_sync check
>> >> >
>> >> > That probably makes sense... though that could race with raid1_spare_active().
>> >> > If a read-error returned just after raid1_spare_active() set In_sync, and
>> >> > before 'count' was subtracted from ->degraded, we would still set uptodate
>> >> > when we shouldn't.
>> >> > It probably make sense to put all of raid1_spare_active inside the spinlock -
>> >> > it doesn't get call often enough that performance is an issue (I hope).
>> >> >
>> >> > So:
>> >> >  1/ change !Faulty to In_sync
>> >> >  2/ extend the spinlock in raid1_spare_active to cover the whole function.
>> >>
>> >>
>> >> I made these fixes and reproduced the issue. However, the result is
>> >> not what we expect:
>> >>
>> >> # raid1_end_read_request() now indeed adds the r1_bio into retry_list,
>> >> as we wanted
>> >> # raid1d calls fix_read_error()
>> >> # fix_read_error() searches for an In_sync drive to read the data
>> >> from. It finds such drive (this is our good drive A)
>> >> # now fix_read_error() wants to rewrite the bad area. But it rewrites
>> >> only on those drives that are In_sync (except the drive it got the
>> >> data from). In our case, it never tries to rewrite the data on drive B
>> >> (drive B is not marked Faulty and not marked In_sync). As a result,
>> >> md_error() is not called, so drive B is still not marked as Failed
>> >> when fix_read_error() completes
>> >> # so handle_read_error() retries the original READ by calling
>> >> read_balance(), which again in my case selects the recovering drive
>> >> B...
>> >>
>> >> And then the whole flow repeats itself again and again...and READ
>> >> never completes.
>> >>
>> >> Maybe we should not allow selecting recovering drives for READ? Or
>> >> some other approach?
>> >>
>> >
>> > Thanks for the testing and analysis.
>> > Presumably we just want handle_read_error() to write to all non-faulty
>> > devices, not just the InSync ones.
>> > i.e. the following patch.
>> >
>> > Thanks,
>> > NeilBrown
>> >
>> > diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
>> > index 6a9c73435eb8..a95f9e179e6f 100644
>> > --- a/drivers/md/raid1.c
>> > +++ b/drivers/md/raid1.c
>> > @@ -2153,7 +2153,7 @@ static void fix_read_error(struct r1conf *conf, int read_disk,
>> >                         d--;
>> >                         rdev = conf->mirrors[d].rdev;
>> >                         if (rdev &&
>> > -                           test_bit(In_sync, &rdev->flags))
>> > +                           !test_bit(Faulty, &rdev->flags))
>> >                                 r1_sync_page_io(rdev, sect, s,
>> >                                                 conf->tmppage, WRITE);
>> >                 }
>> > @@ -2165,7 +2165,7 @@ static void fix_read_error(struct r1conf *conf, int read_disk,
>> >                         d--;
>> >                         rdev = conf->mirrors[d].rdev;
>> >                         if (rdev &&
>> > -                           test_bit(In_sync, &rdev->flags)) {
>> > +                           !test_bit(Faulty, &rdev->flags)) {
>> >                                 if (r1_sync_page_io(rdev, sect, s,
>> >                                                     conf->tmppage, READ)) {
>> >                                         atomic_add(s, &rdev->corrected_errors);
>>
>> With this addition, the problem appears to be fixed. We will give it
>> some regression testing & will let you know if we see any issues.
>>
>> I presume you will be applying this fix upstream as well.
>
> Yes, it is already in my for-next branch.
> I've just added your tested-by.
>
> Thanks,
> NeilBrown
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html