On Wed, 13 Feb 2013 15:30:30 +0100 Sebastian Riemer <sebastian.riemer@xxxxxxxxxxxxxxxx> wrote: > On 13.02.2013 12:45, Sebastian Riemer wrote: > > On 13.02.2013 03:38, NeilBrown wrote: > >> diff --git a/drivers/md/md.c b/drivers/md/md.c > >> index 8b557d2..292cc2f 100644 > >> --- a/drivers/md/md.c > >> +++ b/drivers/md/md.c > >> @@ -6529,7 +6529,17 @@ static int md_ioctl(struct block_device *bdev, fmode_t mode, > >> mddev->ro = 0; > >> sysfs_notify_dirent_safe(mddev->sysfs_state); > >> set_bit(MD_RECOVERY_NEEDED, &mddev->recovery); > >> - md_wakeup_thread(mddev->thread); > >> + /* mddev_unlock will wake thread */ > >> + /* If a device failed while we were read-only, we > >> + * need to make sure the metadata is updated now. > >> + */ > >> + if (test_bit(MD_CHANGE_DEVS, &mddev->flags)) { > >> + mddev_unlock(mddev); > >> + wait_event(mddev->sb_wait, > >> + !test_bit(MD_CHANGE_DEVS, &mddev->flags) && > >> + !test_bit(MD_CHANGE_PENDING, &mddev->flags)); > >> + mddev_lock(mddev); > >> + } > >> } else { > >> err = -EROFS; > >> goto abort_unlock; > >> > > > > Thanks, Neil! > > > > I can confirm the issue on 3.4.y and that your patch fixes it reliably. > > > > Acked-by: Sebastian Riemer <sebastian.riemer@xxxxxxxxxxxxxxxx> > > > > Damn, I've got a kernel which still crashes in > reap_sync_thread->raid1_spare_active() with NULL pointer dereference > although this patch is applied. So the fix isn't correct, yet. > > I did some "objdump -S" on raid1.ko and found the issue at the following > code location in raid1_spare_active(): > # for (i = 0; i < conf->raid_disks; i++) { > # struct md_rdev *rdev = conf->mirrors[i].rdev; > # struct md_rdev *repl = conf->mirrors[conf->raid_disks + i].rdev; > > A resync was pending (create without --assume-clean). > For me it looks like the faulty setting races with the syncer. The rdev > isn't registered in the personality anymore but the syncer tries to > access it for immediate resync. > Where exactly is it crashing? Can I see the complete Oops message? The code you have identified cannot crash unless conf->raid_disks has become inconsistent with the allocation of ->mirrors, and that is very unlikely. Both 'rdev' and 'repl' are tested for NULL before they are used... If you can get me the Oops message I can probably narrow it down. Thanks, NeilBrown
Attachment:
signature.asc
Description: PGP signature