On Thu, Feb 25 2016, Shaohua Li wrote: > > As for the bug, write requests run in raid5d, mddev_suspend() waits for all IO, > which waits for the write requests. So this is a clear deadlock. I think we > should delete the check_reshape() in md_check_recovery(). If we change > layout/disks/chunk_size, check_reshape() is already called. If we start an > array, the .run() already handles new layout. There is no point > md_check_recovery() check_reshape() again. Are you sure? Did you look at the commit which added that code? commit b4c4c7b8095298ff4ce20b40bf180ada070812d0 When there is an IO error, reshape (or resync or recovery) will abort and then possibly be automatically restarted. Without the check here a reshape might be attempted on an array which has failed. Not sure if that would be harmful, but it would certainly be pointless. But you are right that this is causing the problem. Maybe we should keep track of the size of the 'scribble' arrays and only call resize_chunks if the size needs to change? Similar to what resize_stripes does. It might also be good to put something like WARN_ON(current == mddev->thread->task); in mddev_suspend() ... or whatever code would cause this sort of error to trigger a warning early. Thanks, NeilBrown > > Artur, can you check if below works for you? > > > diff --git a/drivers/md/md.c b/drivers/md/md.c > index 464627b..7fb1103 100644 > --- a/drivers/md/md.c > +++ b/drivers/md/md.c > @@ -8408,8 +8408,7 @@ void md_check_recovery(struct mddev *mddev) > */ > > if (mddev->reshape_position != MaxSector) { > - if (mddev->pers->check_reshape == NULL || > - mddev->pers->check_reshape(mddev) != 0) > + if (mddev->pers->check_reshape == NULL) > /* Cannot proceed */ > goto not_running; > set_bit(MD_RECOVERY_RESHAPE, &mddev->recovery); > > Thanks, > Shaohua
Attachment:
signature.asc
Description: PGP signature