Re: raid1 repair: sync_request() aborts if one of the drives has bad block recorded

Alexander Lyakas <alex.bolshoy@xxxxxxxxx> · Tue, 17 Jul 2012 16:17:55 +0300

Thanks for your comments, I got confused with the REQUESTED bit.
I prepared the patch, with couple of notes:

1/ I decided to be more careful and schedule a write only in case of
resync or repair. I was not sure whether we should try to correct bad
blocks on device X, when device Y is recovering. Pls change it if you
feel otherwise.

2/ I tested and committed the patch on top of ubuntu-precise 3.2.0-25.
I looked at your "for-next" branch, and saw that there is some new
code, which handles hot-replace, which I am not familiar with at this
point.

>From 2cf85226e76ab7f5d3c2bfdb207244cd9ed7ae19 Mon Sep 17 00:00:00 2001
From: Alex Lyakas <alex.bolshoy@xxxxxxxxx>
Date: Tue, 17 Jul 2012 15:47:35 +0300
Subject: [PATCH] RAID1: When doing resync or repair, attempt to correct bad
 blocks, according to WriteErrorSeen policy

Signed-off-by: Alex Lyakas <alex.bolshoy@xxxxxxxxx>

diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 7af60ec..62e4f44 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -2257,6 +2257,18 @@ static sector_t sync_request(struct mddev
*mddev, sector_t sector_nr, int *skipp
 				bio->bi_rw = READ;
 				bio->bi_end_io = end_sync_read;
 				read_targets++;
+			} else if (!test_bit(WriteErrorSeen, &rdev->flags) &&
+				test_bit(MD_RECOVERY_SYNC, &mddev->recovery) &&
+				!test_bit(MD_RECOVERY_CHECK, &mddev->recovery)) {
+				/*
+				 * The device is suitable for reading (InSync),
+				 * but has bad block(s) here. Let's try to correct them,
+				 * if we are doing resync or repair. Otherwise, leave
+				 * this device alone for this sync request.
+				 */
+				bio->bi_rw = WRITE;
+				bio->bi_end_io = end_sync_write;
+				write_targets++;
 			}
 		}
 		if (bio->bi_end_io) {
=================================

Final note: I noticed that badblocks_show() fails if there are too
many bad blocks. It returns value larger than PAGE_SIZE, and then the
following linux code complains:
fs/sysfs/file.c:fill_read_buffer()
	/*
	 * The code works fine with PAGE_SIZE return but it's likely to
	 * indicate truncated result or overflow in normal use cases.
	 */
	if (count >= (ssize_t)PAGE_SIZE) {
		print_symbol("fill_read_buffer: %s returned bad count\n",
			(unsigned long)ops->show);
		/* Try to struggle along */
		count = PAGE_SIZE - 1;
	}

So I am not sure how to solve it, but it would be good for
user/application to receive the full list of bad blocks. Perhaps
application can pass fd via some ioctl (I feel you don't like ioctls),
and then kernel can use vfs_write() to print all the bad blocks to the
fd. Or simply return the bad blocks list through the ioctl output to
mdadm, and mdadm would print them. Perhaps some other way.

Thanks for your help!
Alex.


On Tue, Jul 17, 2012 at 4:17 AM, NeilBrown <neilb@xxxxxxx> wrote:
> On Mon, 16 Jul 2012 11:45:12 +0300 Alexander Lyakas <alex.bolshoy@xxxxxxxxx>
> wrote:
>
>> Hi Neil,
>> thanks for your comments.
>>
>> > I don't see that MD_RECOVERY_REQUESTED is really an issue is it?
>> > We the wrong thing happen in any case where there just one device with no bad
>> > block, and at least one device with a bad block.  In that case we want to
>> > skip forward by the number of bad blocks, not skip to the end of the array.
>>
>> I think that MD_RECOVERY_REQUESTED has some meaning here. Because
>> there are only two places where we increase "write_targets":
>> Here:
>>               } else if (!test_bit(In_sync, &rdev->flags)) {
>>                       bio->bi_rw = WRITE;
>>                       bio->bi_end_io = end_sync_write;
>>                       write_targets ++;
>> and
>>       if (test_bit(MD_RECOVERY_SYNC, &mddev->recovery) && read_targets > 0)
>>               /* extra read targets are also write targets */
>>               write_targets += read_targets-1;
>>
>> So if we'll see a device that is not In_sync, we will increase
>> write_targets and won't fall into that issue. That's why I was
>> thinking to check for MD_RECOVERY_REQUESTED.
>
> But neither of the code segements you have identified depend on
> MD_RECOVERY_REQUESTED.
>
> The problem occurs with resync/repair/check but not with recover.
>
>
>>
>> >
>> > So this?
>> >
>> > diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
>> > index 240ff31..c443f93 100644
>> > --- a/drivers/md/raid1.c
>> > +++ b/drivers/md/raid1.c
>> > @@ -2422,7 +2422,10 @@ static sector_t sync_request(struct mddev *mddev, sector_t sector_nr, int *skipp
>> >                 /* There is nowhere to write, so all non-sync
>> >                  * drives must be failed - so we are finished
>> >                  */
>> > -               sector_t rv = max_sector - sector_nr;
>> > +               sector_t rv;
>> > +               if (min_bad > 0)
>> > +                       max_sector = sector_nr + min_bad;
>> > +               rv = max_sector - sector_nr;
>> >                 *skipped = 1;
>> >                 put_buf(r1_bio);
>> >                 return rv;
>> >
>>
>> I tested it and it works. However, I have a question: shouldn't we try
>> to clear bad blocks during "repair" in particular or during any
>> kind-of-sync in general?
>
> Thanks for testing.
>
>
>>
>> Because currently the following is happening:
>> - In sync_request devices are selected as candidates for READs and for
>> WRITEs (using various criteria)
>> - Then we issue the READs and eventually end up in sync_request_write()
>> - Here we schedule WRITEs (without consulting bad blocks or WriteErrorSeen bit)
>> - In end_sync_write, we check if we can clear some bad blocks, and if
>> yes, eventually we end up in handle_sync_write_finished(), where we
>> clear bad blocks
>>
>> So getting back to sync_request(), the issue is this: if we consider a
>> device for READs, we check for badblocks. If we find a badblock, we
>> skip it and don't consider this device neither for READs nor for
>> WRITEs. How about:
>> - We consider a device for READs
>> - If it has a bad block, we consider it for WRITEs instead (so we have
>> a 3rd place where we consider the device for WRITE).
>> - We may consult WriteErrorSeen bit as we do in normal make_request(),
>> but not sure, because right now on sync path, we don't consult it
>
> Good point.
> I think we should consider WriteErrorSeen here.  We don't currently consult
> it in sync because we avoid bad blocks completely.
> So you patch would become:
>
>   } else if (!test_bit(WriteErrorSeen, &rdev->flags)) {
>          bio->bi_rw = WRITE;
>          bio->bi_end_io = end_sync_write;
>          write_targets++;
>   }
>
> If you could confirm that works and resubmit it with a 'Signed-off-by:' line
> I'll apply it.
> Thanks.
> I've already included my previous patch in my queue and tagged it for -stable.
>
> NeilBrown
>
>>
>> So this patch:
>> --- c:/work/code_backups/psp13/raid1.c  Sun Jul 15 16:39:07 2012
>> +++ ubuntu-precise/source/drivers/md/raid1.c    Mon Jul 16 11:35:40 2012
>> @@ -2385,20 +2385,25 @@
>>                                         if (wonly < 0)
>>                                                 wonly = i;
>>                                 } else {
>>                                         if (disk < 0)
>>                                                 disk = i;
>>                                 }
>>                                 bio->bi_rw = READ;
>>                                 bio->bi_end_io = end_sync_read;
>>                                 read_targets++;
>>                         }
>> +                       else {
>> +                               bio->bi_rw = WRITE;
>> +                               bio->bi_end_io = end_sync_write;
>> +                               write_targets ++;
>> +                       }
>>                 }
>>                 if (bio->bi_end_io) {
>>                         atomic_inc(&rdev->nr_pending);
>>                         bio->bi_sector = sector_nr + rdev->data_offset;
>>                         bio->bi_bdev = rdev->bdev;
>>                         bio->bi_private = r1_bio;
>>                 }
>>         }
>>         rcu_read_unlock();
>>         if (disk < 0)
>>
>>
>> I tested this patch (together with yours) and it cleared the bad
>> blocks, but not sure what else I might be missing. My Linux kernel
>> programming skills are still ... questionable.
>>
>> Thanks,
>> Alex.
>>
>>
>>
>> >
>> > Thanks,
>> > NeilBrown
>> >
>
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html