On Mon, Sep 04 2017, Xiao Ni wrote: > > > In function handle_stripe: > 4697 if (s.handle_bad_blocks || > 4698 test_bit(MD_SB_CHANGE_PENDING, &conf->mddev->sb_flags)) { > 4699 set_bit(STRIPE_HANDLE, &sh->state); > 4700 goto finish; > 4701 } > > Because MD_SB_CHANGE_PENDING is set, so the stripes can't be handled. > Right, of course. I see what is happening now. - raid5d cannot complete stripes until the metadata is written - the metadata cannot be written until raid5d gets the mddev_lock - mddev_lock is held by the write to suspend_hi - the write to suspend_hi is waiting for raid5_quiesce - raid5_quiesce is waiting for some stripes to complete. We could declare that ->quiesce(, 1) cannot be called while holding the lock. We could possible allow it but only if md_update_sb() is called first, though that might still be racy. ->quiesce(, 1) is currently called from: mddev_suspend suspend_lo_store suspend_hi_store __md_stop_writes mddev_detach set_bitmap_file update_array_info (when setting/removing internal bitmap) md_do_sync and most of those are call with the lock held, or take the lock. Maybe we should *require* that mddev_lock is held when calling ->quiesce() and have ->quiesce() do the metadata update. Something like the following maybe. Can you test it? Thanks, NeilBrown diff --git a/drivers/md/md.c b/drivers/md/md.c index b01e458d31e9..999ccf08c5db 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -5805,9 +5805,11 @@ void md_stop(struct mddev *mddev) /* stop the array and free an attached data structures. * This is called from dm-raid */ + mddev_lock_nointr(mddev); __md_stop(mddev); if (mddev->bio_set) bioset_free(mddev->bio_set); + mddev_unlock(mddev); } EXPORT_SYMBOL_GPL(md_stop); diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c index 0fc2748aaf95..cde5a82eb404 100644 --- a/drivers/md/raid5.c +++ b/drivers/md/raid5.c @@ -4316,6 +4316,8 @@ static void handle_stripe_expansion(struct r5conf *conf, struct stripe_head *sh) /* place all the copies on one channel */ init_async_submit(&submit, 0, tx, NULL, NULL, NULL); + WARN_ON(sh2->dev[dd_idx].page != sh2->dev[dd_idx].orig_page); + WARN_ON(sh->dev[i].page != sh->dev[i].orig_page); tx = async_memcpy(sh2->dev[dd_idx].page, sh->dev[i].page, 0, 0, STRIPE_SIZE, &submit); @@ -8031,7 +8033,10 @@ static void raid5_quiesce(struct mddev *mddev, int state) wait_event_cmd(conf->wait_for_quiescent, atomic_read(&conf->active_stripes) == 0 && atomic_read(&conf->active_aligned_reads) == 0, - unlock_all_device_hash_locks_irq(conf), + ({unlock_all_device_hash_locks_irq(conf); + if (mddev->sb_flags) + md_update_sb(mddev, 0); + }), lock_all_device_hash_locks_irq(conf)); conf->quiesce = 1; unlock_all_device_hash_locks_irq(conf);
Attachment:
signature.asc
Description: PGP signature