Re: Set disk faulty / hot disk remove ioctl bug for read-only MD?

NeilBrown <neilb@xxxxxxx> · Thu, 14 Feb 2013 08:55:21 +1100

On Wed, 13 Feb 2013 15:30:30 +0100 Sebastian Riemer
<sebastian.riemer@xxxxxxxxxxxxxxxx> wrote:

> On 13.02.2013 12:45, Sebastian Riemer wrote:
> > On 13.02.2013 03:38, NeilBrown wrote:
> >> diff --git a/drivers/md/md.c b/drivers/md/md.c
> >> index 8b557d2..292cc2f 100644
> >> --- a/drivers/md/md.c
> >> +++ b/drivers/md/md.c
> >> @@ -6529,7 +6529,17 @@ static int md_ioctl(struct block_device *bdev, fmode_t mode,
> >>  			mddev->ro = 0;
> >>  			sysfs_notify_dirent_safe(mddev->sysfs_state);
> >>  			set_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
> >> -			md_wakeup_thread(mddev->thread);
> >> +			/* mddev_unlock will wake thread */
> >> +			/* If a device failed while we were read-only, we
> >> +			 * need to make sure the metadata is updated now.
> >> +			 */
> >> +			if (test_bit(MD_CHANGE_DEVS, &mddev->flags)) {
> >> +				mddev_unlock(mddev);
> >> +				wait_event(mddev->sb_wait,
> >> +					   !test_bit(MD_CHANGE_DEVS, &mddev->flags) &&
> >> +					   !test_bit(MD_CHANGE_PENDING, &mddev->flags));
> >> +				mddev_lock(mddev);
> >> +			}
> >>  		} else {
> >>  			err = -EROFS;
> >>  			goto abort_unlock;
> >>
> > 
> > Thanks, Neil!
> > 
> > I can confirm the issue on 3.4.y and that your patch fixes it reliably.
> > 
> > Acked-by: Sebastian Riemer <sebastian.riemer@xxxxxxxxxxxxxxxx>
> > 
> 
> Damn, I've got a kernel which still crashes in
> reap_sync_thread->raid1_spare_active() with NULL pointer dereference
> although this patch is applied. So the fix isn't correct, yet.
> 
> I did some "objdump -S" on raid1.ko and found the issue at the following
> code location in raid1_spare_active():
> #	for (i = 0; i < conf->raid_disks; i++) {
> #		struct md_rdev *rdev = conf->mirrors[i].rdev;
> #		struct md_rdev *repl = conf->mirrors[conf->raid_disks + i].rdev;
> 
> A resync was pending (create without --assume-clean).
> For me it looks like the faulty setting races with the syncer. The rdev
> isn't registered in the personality anymore but the syncer tries to
> access it for immediate resync.
> 

Where exactly is it crashing?  Can I see the complete Oops message?
The code you have identified cannot crash unless conf->raid_disks has become
inconsistent with the allocation of ->mirrors, and that is very unlikely.
Both 'rdev' and 'repl' are tested for NULL before they are used...

If you can get me the Oops message I can probably narrow it down.

Thanks,
NeilBrown

Attachment:
signature.asc

Description: PGP signature