On Tue, 4 Jun 2013 20:54:28 +0300 Alexander Lyakas <alex.bolshoy@xxxxxxxxx> wrote: > Hello Neil, > > On Tue, Jun 4, 2013 at 2:52 AM, NeilBrown <neilb@xxxxxxx> wrote: > > On Tue, 28 May 2013 15:46:33 +0300 Alexander Lyakas <alex.bolshoy@xxxxxxxxx> > > wrote: > > > >> Hello Neil, > >> we continue testing last-drive RAID1 failure cases. > >> We see the following issue: > >> > >> # RAID1 with drives A and B; drive B was freshly-added and is rebuilding > >> # Drive A fails > >> # WRITE request arrives to the array. It is failed by drive A, so > >> r1_bio is marked as R1BIO_WriteError, but the rebuilding drive B > >> succeeds in writing it, so the same r1_bio is marked as > >> R1BIO_Uptodate. > >> # r1_bio arrives to handle_write_finished, badblocks are disabled, > >> md_error()->error() does nothing because we don't fail the last drive > >> of raid1 > >> # raid_end_bio_io() calls call_bio_endio() > >> # As a result, in call_bio_endio(): > >> if (!test_bit(R1BIO_Uptodate, &r1_bio->state)) > >> clear_bit(BIO_UPTODATE, &bio->bi_flags); > >> this code doesn't clear the BIO_UPTODATE flag, and the whole master > >> WRITE succeeds, back to the upper layer. > >> > >> # This keeps happening until rebuild aborts, and drive B is ejected > >> from the array[1]. After that, there is only one drive (A), so after > >> it fails a WRITE, the master WRITE also fails. > >> > >> It should be noted, that I test a WRITE that is way ahead of > >> recovery_offset of drive B. So after such WRITE fails, subsequent READ > >> to the same place would fail, because drive A will fail it, and drive > >> B cannot be attempted to READ from there (rebuild has not reached > >> there yet). > >> > >> My concrete suggestion is that this behavior is not reasonable, and we > >> should only count a successful WRITE to a drive that is marked as > >> InSync. Please let me know what do you think? > > > > Sounds reasonable. Could you make and test a patch? Then I'll apply it. > > I read all the code of raid1.c, and it seems that we need to fix only > one place. With this fix, I don't see corruption after I fail and then > revive the last drive of raid1. I also attach the patch, in case Gmail > wraps it. > > Also, with this fix, the rebuilding drive is ejected almost > immediately, so the second part of the problem doesn't happen now. > Although I will consider your suggestion for it later. > > > > >From 8d25ada35b5e1ec6ea4c6da6e571ec691b34086a Mon Sep 17 00:00:00 2001 > From: Alex Lyakas <alex@xxxxxxxxxxxxxxxxx> > Date: Tue, 4 Jun 2013 20:42:21 +0300 > Subject: [PATCH] md/raid1: consider WRITE as successful only if at least one > non-Faulty and non-rebuilding drive completed it. > > Without that fix, the following scenario could happen: > > - RAID1 with drives A and B; drive B was freshly-added and is rebuilding > - Drive A fails > - WRITE request arrives to the array. It is failed by drive A, so > r1_bio is marked as R1BIO_WriteError, but the rebuilding drive B > succeeds in writing it, so the same r1_bio is marked as > R1BIO_Uptodate. > - r1_bio arrives to handle_write_finished, badblocks are disabled, > md_error()->error() does nothing because we don't fail the last drive > of raid1 > - raid_end_bio_io() calls call_bio_endio() > - As a result, in call_bio_endio(): > if (!test_bit(R1BIO_Uptodate, &r1_bio->state)) > clear_bit(BIO_UPTODATE, &bio->bi_flags); > this code doesn't clear the BIO_UPTODATE flag, and the whole master > WRITE succeeds, back to the upper layer. > > So we returned success to the upper layer, even though we had written > the data onto the rebuilding drive only. But when we want to read the > data back, we would not read from the rebuilding drive, so this data > is lost. > > Signed-off-by: Alex Lyakas <alex@xxxxxxxxxxxxxxxxx> > --- > drivers/md/raid1.c | 12 +++++++++++- > 1 file changed, 11 insertions(+), 1 deletion(-) > > diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c > index 6af167f..d8d159e 100644 > --- a/drivers/md/raid1.c > +++ b/drivers/md/raid1.c > @@ -427,7 +427,17 @@ static void raid1_end_write_request(struct bio > *bio, int error) > > r1_bio->bios[mirror] = NULL; > to_put = bio; > - set_bit(R1BIO_Uptodate, &r1_bio->state); > + /* > + * Do not set R1BIO_Uptodate if the current device is > + * rebuilding or Faulty. This is because we cannot use > + * such device for properly reading the data back (we could > + * potentially use it, if the current write would have felt > + * before rdev->recovery_offset, but for simplicity we don't > + * check this here. > + */ > + if (test_bit(In_sync, &conf->mirrors[mirror].rdev->flags) && > + !test_bit(Faulty, &conf->mirrors[mirror].rdev->flags)) > + set_bit(R1BIO_Uptodate, &r1_bio->state); > > /* Maybe we can clear some bad blocks. */ > if (is_badblock(conf->mirrors[mirror].rdev, hi Alex, thanks for the excellent analysis and elegant patch. I've applied it and create a similar patch for raid10 which has the same problem. I'll send them upstream sometime this week. Thanks, NeilBrown
Attachment:
signature.asc
Description: PGP signature