Re: Stuck in md_write_start because MD_SB_CHANGE_PENDING can't be cleared

Xiao Ni <xni@xxxxxxxxxx> · Mon, 11 Sep 2017 01:03:11 -0400 (EDT)

----- Original Message -----
> From: "NeilBrown" <neilb@xxxxxxxx>
> To: "Xiao Ni" <xni@xxxxxxxxxx>
> Cc: "linux-raid" <linux-raid@xxxxxxxxxxxxxxx>, shli@xxxxxxxxxx
> Sent: Monday, September 11, 2017 11:36:25 AM
> Subject: Re: Stuck in md_write_start because MD_SB_CHANGE_PENDING can't be cleared
> 
> On Sun, Sep 10 2017, Xiao Ni wrote:
> 
> > ----- Original Message -----
> >> From: "NeilBrown" <neilb@xxxxxxxx>
> >> To: "Xiao Ni" <xni@xxxxxxxxxx>, "linux-raid" <linux-raid@xxxxxxxxxxxxxxx>
> >> Cc: shli@xxxxxxxxxx
> >> Sent: Thursday, September 7, 2017 1:37:45 PM
> >> Subject: Re: Stuck in md_write_start because MD_SB_CHANGE_PENDING can't be
> >> cleared
> >> 
> >> On Wed, Sep 06 2017, Xiao Ni wrote:
> >> 
> >> > ----- Original Message -----
> >> >> From: "Xiao Ni" <xni@xxxxxxxxxx>
> >> >> To: "NeilBrown" <neilb@xxxxxxxx>, "linux-raid"
> >> >> <linux-raid@xxxxxxxxxxxxxxx>
> >> >> Cc: shli@xxxxxxxxxx
> >> >> Sent: Tuesday, September 5, 2017 10:15:00 AM
> >> >> Subject: Re: Stuck in md_write_start because MD_SB_CHANGE_PENDING can't
> >> >> be
> >> >> cleared
> >> >> 
> >> >> 
> >> >> 
> >> >> On 09/05/2017 09:36 AM, NeilBrown wrote:
> >> >> > On Mon, Sep 04 2017, Xiao Ni wrote:
> >> >> >
> >> >> >>
> >> >> >> In function handle_stripe:
> >> >> >> 4697         if (s.handle_bad_blocks ||
> >> >> >> 4698             test_bit(MD_SB_CHANGE_PENDING,
> >> >> >> &conf->mddev->sb_flags)) {
> >> >> >> 4699                 set_bit(STRIPE_HANDLE, &sh->state);
> >> >> >> 4700                 goto finish;
> >> >> >> 4701         }
> >> >> >>
> >> >> >> Because MD_SB_CHANGE_PENDING is set, so the stripes can't be
> >> >> >> handled.
> >> >> >>
> >> >> > Right, of course.  I see what is happening now.
> >> >> >
> >> >> > - raid5d cannot complete stripes until the metadata is written
> >> >> > - the metadata cannot be written until raid5d gets the mddev_lock
> >> >> > - mddev_lock is held by the write to suspend_hi
> >> >> > - the write to suspend_hi is waiting for raid5_quiesce
> >> >> > - raid5_quiesce is waiting for some stripes to complete.
> >> >> >
> >> >> > We could declare that ->quiesce(, 1) cannot be called while holding
> >> >> > the
> >> >> > lock.
> >> >> > We could possible allow it but only if md_update_sb() is called
> >> >> > first,
> >> >> > though that might still be racy.
> >> >> >
> >> >> > ->quiesce(, 1) is currently called from:
> >> >> >   mddev_suspend
> >> >> >   suspend_lo_store
> >> >> >   suspend_hi_store
> >> >> >   __md_stop_writes
> >> >> >   mddev_detach
> >> >> >   set_bitmap_file
> >> >> >   update_array_info (when setting/removing internal bitmap)
> >> >> >   md_do_sync
> >> >> >
> >> >> > and most of those are call with the lock held, or take the lock.
> >> >> >
> >> >> > Maybe we should *require* that mddev_lock is held when calling
> >> >> > ->quiesce() and have ->quiesce() do the metadata update.
> >> >> >
> >> >> > Something like the following maybe.  Can you test it?
> >> >> 
> >> >> Hi Neil
> >> >> 
> >> >> Thanks for the analysis. I need to thing for a while :)
> >> >> I already added the patch and the test is running now. It usually needs
> >> >> more than 5
> >> >> hours to reproduce this problem. I'll let it run more than 24 hours.
> >> >> I'll update the test
> >> >> result later.
> >> >
> >> > Hi Neil
> >> >
> >> > The problem still exists. But it doesn't show calltrace this time. It
> >> > was stuck yesterday. I didn't notice that because there has no
> >> > calltrace.
> >> >
> >> > echo file raid5.c +p > /sys/kernel/debug/dynamic_debug/control
> >> >
> >> > It shows that raid5d is still spinning.
> >> 
> >> Thanks for testing. I've thought some more and I think there is a better
> >> approach.
> >> The fact that we need to take the mutex to write the super block has
> >> caused problems several times before and is a key part of the problem
> >> now.
> >> Maybe we should relax that.  Obviously we don't want two threads
> >> updating the metadata at the same time, but it should be safe to
> >> update it in parallel with other uses of reconfix_mutex.
> >> 
> >> Holding mddev->lock while copying data from the struct mddev to the
> >> superblock (which we do) should ensure that the superblock is internally
> >> consistent, and that should be enough.
> >> 
> >> So I propose the following patch.  It certainly needs review and
> >> testing, but I think it should make a big improvement.
> >> 
> >> Thanks,
> >> NeilBrown
> >> 
> >> 
> >> diff --git a/drivers/md/md.c b/drivers/md/md.c
> >> index b01e458d31e9..414a4c1a052d 100644
> >> --- a/drivers/md/md.c
> >> +++ b/drivers/md/md.c
> >> @@ -2388,6 +2388,15 @@ void md_update_sb(struct mddev *mddev, int
> >> force_change)
> >>  		return;
> >>  	}
> >>  
> >> +	if (!force_change && !(mddev->sb_flags & ~BIT(MD_SB_UPDATE_ACTIVE)))
> >> +		return;
> >> +
> >> +	wait_event(mddev->sb_wait,
> >> +		   !test_and_set_bit(MD_SB_UPDATE_ACTIVE, &mddev->sb_flags));
> >> +
> >> +	if (!force_change && !(mddev->sb_flags & ~BIT(MD_SB_UPDATE_ACTIVE)))
> >> +		goto out;
> >> +
> >>  repeat:
> >>  	if (mddev_is_clustered(mddev)) {
> >>  		if (test_and_clear_bit(MD_SB_CHANGE_DEVS, &mddev->sb_flags))
> >> @@ -2402,7 +2411,7 @@ void md_update_sb(struct mddev *mddev, int
> >> force_change)
> >>  			bit_clear_unless(&mddev->sb_flags, BIT(MD_SB_CHANGE_PENDING),
> >>  							 BIT(MD_SB_CHANGE_DEVS) |
> >>  							 BIT(MD_SB_CHANGE_CLEAN));
> >> -			return;
> >> +			goto out;
> >>  		}
> >>  	}
> >>  
> >> @@ -2432,8 +2441,7 @@ void md_update_sb(struct mddev *mddev, int
> >> force_change)
> >>  				wake_up(&rdev->blocked_wait);
> >>  			}
> >>  		}
> >> -		wake_up(&mddev->sb_wait);
> >> -		return;
> >> +		goto out;
> >>  	}
> >>  
> >>  	spin_lock(&mddev->lock);
> >> @@ -2544,6 +2552,9 @@ void md_update_sb(struct mddev *mddev, int
> >> force_change)
> >>  			       BIT(MD_SB_CHANGE_DEVS) | BIT(MD_SB_CHANGE_CLEAN)))
> >>  		/* have to write it out again */
> >>  		goto repeat;
> >> +
> >> +out:
> >> +	clear_bit_unlock(MD_SB_UPDATE_ACTIVE, &mddev->sb_flags);
> >>  	wake_up(&mddev->sb_wait);
> >>  	if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery))
> >>  		sysfs_notify(&mddev->kobj, NULL, "sync_completed");
> >> @@ -5606,8 +5617,7 @@ int md_run(struct mddev *mddev)
> >>  		set_bit(MD_RECOVERY_RECOVER, &mddev->recovery);
> >>  	set_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
> >>  
> >> -	if (mddev->sb_flags)
> >> -		md_update_sb(mddev, 0);
> >> +	md_update_sb(mddev, 0);
> >>  
> >>  	md_new_event(mddev);
> >>  	sysfs_notify_dirent_safe(mddev->sysfs_state);
> >> @@ -8643,17 +8653,14 @@ void md_check_recovery(struct mddev *mddev)
> >>  
> >>  	if (mddev->ro && !test_bit(MD_RECOVERY_NEEDED, &mddev->recovery))
> >>  		return;
> >> -	if ( ! (
> >> -		(mddev->sb_flags & ~ (1<<MD_SB_CHANGE_PENDING)) ||
> >> +	if ((
> >>  		test_bit(MD_RECOVERY_NEEDED, &mddev->recovery) ||
> >>  		test_bit(MD_RECOVERY_DONE, &mddev->recovery) ||
> >>  		(mddev->external == 0 && mddev->safemode == 1) ||
> >>  		(mddev->safemode == 2
> >>  		 && !mddev->in_sync && mddev->recovery_cp == MaxSector)
> >> -		))
> >> -		return;
> >> -
> >> -	if (mddev_trylock(mddev)) {
> >> +		     ) &&
> >> +	    mddev_trylock(mddev)) {
> >>  		int spares = 0;
> >>  
> >>  		if (!mddev->external && mddev->safemode == 1)
> >> @@ -8706,9 +8713,6 @@ void md_check_recovery(struct mddev *mddev)
> >>  			spin_unlock(&mddev->lock);
> >>  		}
> >>  
> >> -		if (mddev->sb_flags)
> >> -			md_update_sb(mddev, 0);
> >> -
> >>  		if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery) &&
> >>  		    !test_bit(MD_RECOVERY_DONE, &mddev->recovery)) {
> >>  			/* resync/recovery still happening */
> >> @@ -8786,6 +8790,7 @@ void md_check_recovery(struct mddev *mddev)
> >>  		wake_up(&mddev->sb_wait);
> >>  		mddev_unlock(mddev);
> >>  	}
> >> +	md_update_sb(mddev, 0);
> >>  }
> >>  EXPORT_SYMBOL(md_check_recovery);
> >>  
> >> diff --git a/drivers/md/md.h b/drivers/md/md.h
> >> index 09db03455801..bc8633cf7c40 100644
> >> --- a/drivers/md/md.h
> >> +++ b/drivers/md/md.h
> >> @@ -243,6 +243,7 @@ enum mddev_sb_flags {
> >>  	MD_SB_CHANGE_CLEAN,	/* transition to or from 'clean' */
> >>  	MD_SB_CHANGE_PENDING,	/* switch from 'clean' to 'active' in progress */
> >>  	MD_SB_NEED_REWRITE,	/* metadata write needs to be repeated */
> >> +	MD_SB_UPDATE_ACTIVE,	/* A thread is running in md_update_sb */
> >>  };
> >>  
> >>  struct mddev {
> >> diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
> >> index 2dcbafa8e66c..76169dd8ff7c 100644
> >> --- a/drivers/md/raid5-cache.c
> >> +++ b/drivers/md/raid5-cache.c
> >> @@ -1334,21 +1334,10 @@ static void
> >> r5l_write_super_and_discard_space(struct
> >> r5l_log *log,
> >>  	mddev = log->rdev->mddev;
> >>  	/*
> >>  	 * Discard could zero data, so before discard we must make sure
> >> -	 * superblock is updated to new log tail. Updating superblock (either
> >> -	 * directly call md_update_sb() or depend on md thread) must hold
> >> -	 * reconfig mutex. On the other hand, raid5_quiesce is called with
> >> -	 * reconfig_mutex hold. The first step of raid5_quiesce() is waitting
> >> -	 * for all IO finish, hence waitting for reclaim thread, while reclaim
> >> -	 * thread is calling this function and waitting for reconfig mutex. So
> >> -	 * there is a deadlock. We workaround this issue with a trylock.
> >> -	 * FIXME: we could miss discard if we can't take reconfig mutex
> >> +	 * superblock is updated to new log tail.
> >>  	 */
> >> -	set_mask_bits(&mddev->sb_flags, 0,
> >> -		BIT(MD_SB_CHANGE_DEVS) | BIT(MD_SB_CHANGE_PENDING));
> >> -	if (!mddev_trylock(mddev))
> >> -		return;
> >> +
> >>  	md_update_sb(mddev, 1);
> >> -	mddev_unlock(mddev);
> >>  
> >>  	/* discard IO error really doesn't matter, ignore it */
> >>  	if (log->last_checkpoint < end) {
> >> 
> >
> > Hi Neil
> >
> > The test have run for three days and the problem is fixed by this patch.
> > Thanks for the help.
> 
> Thanks for testing.  I'll look over the patch again and see if there is
> any chance that the locking change could introduce other problems.

Ok, if you have new patches I can do the test. 

> 
> >
> > Could you help to look at https://www.spinics.net/lists/raid/msg58918.html.
> > The bug which is fixed by your patch was found when I try to reproduce that
> > bug. I did a simply analysis, but I'm not sure whether I'm right or not.
> 
> It might be the same bug, but if it is there should be other processes
> stuck in a D wait, one of them holding the reconfig_mutex and waiting
> for the array to quiesce.

Hmm but raid5d is stuck in this problem. It's stuck at md_check_recovery. 
And conf->quiesce is 0 in this problem. 

> 
> Where there any other processes in D wait?

Yes, there is another process is in D state:

[440793.893432] INFO: task jbd2/md0-8:27647 blocked for more than 120 seconds.
[440793.900387] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[440793.908299] jbd2/md0-8      D ffff8801fe815e00     0 27647      2 0x00000080
[440793.908300]  ffff88021ae9fc90 0000000000000046 ffff88022084edd0 ffff88021ae9ffd8
[440793.915846]  ffff88021ae9ffd8 ffff88021ae9ffd8 ffff88022084edd0 ffff88021ae9fda0
[440793.923390]  ffff88005399e0c0 ffff88022084edd0 ffff88021ae9fd88 ffff8801fe815e00
[440793.930931] Call Trace:
[440793.933467]  [<ffffffff81689d39>] schedule+0x29/0x70
[440793.938522]  [<ffffffffa0a15238>] jbd2_journal_commit_transaction+0x248/0x19a0 [jbd2]
[440793.946434]  [<ffffffff810c780e>] ? account_entity_dequeue+0xae/0xd0
[440793.952876]  [<ffffffff810cb2fc>] ? dequeue_entity+0x11c/0x5d0
[440793.958797]  [<ffffffff81028569>] ? __switch_to+0xd9/0x4c0
[440793.964372]  [<ffffffff810af390>] ? wake_up_atomic_t+0x30/0x30
[440793.970293]  [<ffffffff810963ce>] ? try_to_del_timer_sync+0x5e/0x90
[440793.976645]  [<ffffffffa0a1b849>] kjournald2+0xc9/0x260 [jbd2]
[440793.982565]  [<ffffffff810af390>] ? wake_up_atomic_t+0x30/0x30
[440793.988485]  [<ffffffffa0a1b780>] ? commit_timeout+0x10/0x10 [jbd2]
[440793.994839]  [<ffffffff810ae2bf>] kthread+0xcf/0xe0
[440793.999803]  [<ffffffff810ae1f0>] ? kthread_create_on_node+0x140/0x140
[440794.006419]  [<ffffffff81695598>] ret_from_fork+0x58/0x90
[440794.011906]  [<ffffffff810ae1f0>] ? kthread_create_on_node+0x140/0x140

Regards
Xiao
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html