Hello Neil, On Tue, Jun 4, 2013 at 2:52 AM, NeilBrown <neilb@xxxxxxx> wrote: > On Tue, 28 May 2013 15:46:33 +0300 Alexander Lyakas <alex.bolshoy@xxxxxxxxx> > wrote: > >> Hello Neil, >> we continue testing last-drive RAID1 failure cases. >> We see the following issue: >> >> # RAID1 with drives A and B; drive B was freshly-added and is rebuilding >> # Drive A fails >> # WRITE request arrives to the array. It is failed by drive A, so >> r1_bio is marked as R1BIO_WriteError, but the rebuilding drive B >> succeeds in writing it, so the same r1_bio is marked as >> R1BIO_Uptodate. >> # r1_bio arrives to handle_write_finished, badblocks are disabled, >> md_error()->error() does nothing because we don't fail the last drive >> of raid1 >> # raid_end_bio_io() calls call_bio_endio() >> # As a result, in call_bio_endio(): >> if (!test_bit(R1BIO_Uptodate, &r1_bio->state)) >> clear_bit(BIO_UPTODATE, &bio->bi_flags); >> this code doesn't clear the BIO_UPTODATE flag, and the whole master >> WRITE succeeds, back to the upper layer. >> >> # This keeps happening until rebuild aborts, and drive B is ejected >> from the array[1]. After that, there is only one drive (A), so after >> it fails a WRITE, the master WRITE also fails. >> >> It should be noted, that I test a WRITE that is way ahead of >> recovery_offset of drive B. So after such WRITE fails, subsequent READ >> to the same place would fail, because drive A will fail it, and drive >> B cannot be attempted to READ from there (rebuild has not reached >> there yet). >> >> My concrete suggestion is that this behavior is not reasonable, and we >> should only count a successful WRITE to a drive that is marked as >> InSync. Please let me know what do you think? > > Sounds reasonable. Could you make and test a patch? Then I'll apply it. I read all the code of raid1.c, and it seems that we need to fix only one place. With this fix, I don't see corruption after I fail and then revive the last drive of raid1. I also attach the patch, in case Gmail wraps it. Also, with this fix, the rebuilding drive is ejected almost immediately, so the second part of the problem doesn't happen now. Although I will consider your suggestion for it later. >From 8d25ada35b5e1ec6ea4c6da6e571ec691b34086a Mon Sep 17 00:00:00 2001 From: Alex Lyakas <alex@xxxxxxxxxxxxxxxxx> Date: Tue, 4 Jun 2013 20:42:21 +0300 Subject: [PATCH] md/raid1: consider WRITE as successful only if at least one non-Faulty and non-rebuilding drive completed it. Without that fix, the following scenario could happen: - RAID1 with drives A and B; drive B was freshly-added and is rebuilding - Drive A fails - WRITE request arrives to the array. It is failed by drive A, so r1_bio is marked as R1BIO_WriteError, but the rebuilding drive B succeeds in writing it, so the same r1_bio is marked as R1BIO_Uptodate. - r1_bio arrives to handle_write_finished, badblocks are disabled, md_error()->error() does nothing because we don't fail the last drive of raid1 - raid_end_bio_io() calls call_bio_endio() - As a result, in call_bio_endio(): if (!test_bit(R1BIO_Uptodate, &r1_bio->state)) clear_bit(BIO_UPTODATE, &bio->bi_flags); this code doesn't clear the BIO_UPTODATE flag, and the whole master WRITE succeeds, back to the upper layer. So we returned success to the upper layer, even though we had written the data onto the rebuilding drive only. But when we want to read the data back, we would not read from the rebuilding drive, so this data is lost. Signed-off-by: Alex Lyakas <alex@xxxxxxxxxxxxxxxxx> --- drivers/md/raid1.c | 12 +++++++++++- 1 file changed, 11 insertions(+), 1 deletion(-) diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c index 6af167f..d8d159e 100644 --- a/drivers/md/raid1.c +++ b/drivers/md/raid1.c @@ -427,7 +427,17 @@ static void raid1_end_write_request(struct bio *bio, int error) r1_bio->bios[mirror] = NULL; to_put = bio; - set_bit(R1BIO_Uptodate, &r1_bio->state); + /* + * Do not set R1BIO_Uptodate if the current device is + * rebuilding or Faulty. This is because we cannot use + * such device for properly reading the data back (we could + * potentially use it, if the current write would have felt + * before rdev->recovery_offset, but for simplicity we don't + * check this here. + */ + if (test_bit(In_sync, &conf->mirrors[mirror].rdev->flags) && + !test_bit(Faulty, &conf->mirrors[mirror].rdev->flags)) + set_bit(R1BIO_Uptodate, &r1_bio->state); /* Maybe we can clear some bad blocks. */ if (is_badblock(conf->mirrors[mirror].rdev, -- 1.7.9.5
Attachment:
0001-md-raid1-consider-WRITE-as-successful-only-if-at-lea.patch
Description: Binary data