Re: [PATCH 2/9] raid5-cache: add trim support for log

Shaohua Li <shli@xxxxxx> · Mon, 12 Oct 2015 09:57:20 -0700

On Mon, Oct 12, 2015 at 05:00:14PM +1100, Neil Brown wrote:
> Shaohua Li <shli@xxxxxx> writes:
> 
> > Since superblock is updated infrequently, we do a simple trim of log
> > disk (a synchronous trim)
> >
> > Signed-off-by: Shaohua Li <shli@xxxxxx>
> > ---
> >  drivers/md/raid5-cache.c | 63 +++++++++++++++++++++++++++++++++++++++++++++++-
> >  1 file changed, 62 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
> > index caeec10..5dbe084 100644
> > --- a/drivers/md/raid5-cache.c
> > +++ b/drivers/md/raid5-cache.c
> > @@ -85,6 +85,7 @@ struct r5l_log {
> >  	spinlock_t no_space_stripes_lock;
> >  
> >  	bool need_cache_flush;
> > +	bool in_teardown;
> >  };
> >  
> >  /*
> > @@ -644,6 +645,60 @@ void r5l_flush_stripe_to_raid(struct r5l_log *log)
> >  }
> >  
> >  static void r5l_write_super(struct r5l_log *log, sector_t cp);
> > +static void r5l_write_super_and_discard_space(struct r5l_log *log,
> > +	sector_t end)
> > +{
> > +	struct block_device *bdev = log->rdev->bdev;
> > +	struct mddev *mddev;
> > +
> > +	r5l_write_super(log, end);
> > +
> > +	if (!blk_queue_discard(bdev_get_queue(bdev)))
> > +		return;
> > +
> > +	mddev = log->rdev->mddev;
> > +	/*
> > +	 * This is to avoid a deadlock. r5l_quiesce holds reconfig_mutex and
> > +	 * wait for this thread to finish. This thread waits for
> > +	 * MD_CHANGE_PENDING clear, which is supposed to be done in
> > +	 * md_check_recovery(). md_check_recovery() tries to get
> > +	 * reconfig_mutex. Since r5l_quiesce already holds the mutex,
> > +	 * md_check_recovery() fails, so the PENDING never get cleared. The
> > +	 * in_teardown check workaround this issue.
> > +	 * */
> 
> Thanks for this comment (but can we just end the comment with "*/"
> instead of  "* */" - that's what everyone else does).

sorry, I will set vim correctly.
> It helps a lot.  If feels a bit clumsy, but maybe that is just me.  At
> least I understand now and it make sense. 
> > +	if (!log->in_teardown) {
> > +		set_bit(MD_CHANGE_DEVS, &mddev->flags);
> > +		set_bit(MD_CHANGE_PENDING, &mddev->flags);
> > +		md_wakeup_thread(mddev->thread);
> > +		wait_event(mddev->sb_wait,
> > +			!test_bit(MD_CHANGE_PENDING, &mddev->flags) ||
> > +			log->in_teardown);
> > +		/*
> > +		 * r5l_quiesce could run after in_teardown check and hold
> > +		 * mutex first. Superblock might get updated twice.
> > +		 * */
> > +		if (log->in_teardown)
> > +			md_update_sb(mddev, 1);
> > +	} else {
> > +		BUG_ON(!mddev_is_locked(mddev));
> > +		md_update_sb(mddev, 1);
> > +	}
> > +
> > +	if (log->last_checkpoint < end) {
> > +		blkdev_issue_discard(bdev,
> > +				log->last_checkpoint + log->rdev->data_offset,
> > +				end - log->last_checkpoint, GFP_NOIO, 0);
> > +	} else {
> > +		blkdev_issue_discard(bdev,
> > +				log->last_checkpoint + log->rdev->data_offset,
> > +				log->device_size - log->last_checkpoint,
> > +				GFP_NOIO, 0);
> > +		blkdev_issue_discard(bdev, log->rdev->data_offset, end,
> > +				GFP_NOIO, 0);
> > +	}
> > +}
> > +
> > +
> >  static void r5l_do_reclaim(struct r5l_log *log)
> >  {
> >  	sector_t reclaim_target = xchg(&log->reclaim_target, 0);
> > @@ -685,7 +740,7 @@ static void r5l_do_reclaim(struct r5l_log *log)
> >  	 * here, because the log area might be reused soon and we don't want to
> >  	 * confuse recovery
> >  	 */
> > -	r5l_write_super(log, next_checkpoint);
> > +	r5l_write_super_and_discard_space(log, next_checkpoint);
> >  
> >  	mutex_lock(&log->io_mutex);
> >  	log->last_checkpoint = next_checkpoint;
> > @@ -721,9 +776,11 @@ static void r5l_wake_reclaim(struct r5l_log *log, sector_t space)
> >  
> >  void r5l_quiesce(struct r5l_log *log, int state)
> >  {
> > +	struct mddev *mddev;
> >  	if (!log || state == 2)
> >  		return;
> >  	if (state == 0) {
> > +		log->in_teardown = 0;
> >  		log->reclaim_thread = md_register_thread(r5l_reclaim_thread,
> >  					log->rdev->mddev, "reclaim");
> >  	} else if (state == 1) {
> > @@ -731,6 +788,10 @@ void r5l_quiesce(struct r5l_log *log, int state)
> >  		 * at this point all stripes are finished, so io_unit is at
> >  		 * least in STRIPE_END state
> >  		 */
> > +		log->in_teardown = 1;
> > +		/* make sure r5l_write_super_and_discard_space exits */
> > +		mddev = log->rdev->mddev;
> > +		wake_up(&mddev->sb_wait);
> >  		r5l_wake_reclaim(log, -1L);
> >  		md_unregister_thread(&log->reclaim_thread);
> 
> I think this is racey (though you haven't changed the code, so it must
> have been racy before).
> There is no guarantee that the thread will actually wake up to handle
> the reclaim before md_unregister_thread marks the thread as
> kthread_should_stop().
> Maybe the thing to do would be
>    md_unregister_thread()
>    log->reclaim_target = (unsigned long) -1;
>    r5l_do_reclaim(log);
> 
> or something like that.  i.e. just do it synchronously.

This doesn't sound like a race. If the thread doesn't run because it
exits after kthread_should_stop check, the reclaim_target should remain
to be -1 after md_unregister_thread

> Then.... maybe you don't need in_teardown.
> When r5l_do_reclaim() is called from the thread, you wait.
> When called from r5l_quiesce(), you md_update_sb().
> 
> Would that work?

It wouldn't. This only fixes the deadlock issue r5l_quiesce calls
r5l_do_reclaim() directly. The problem is r5l_wake_reclaim() will wake
the thread, md_unregister_thread will wait till the thread exits. At
that time the thread might try to lock the reconfig_mutex. Since
r5l_quesce already holds the mutex, the md_unregister_thread never
returns. In summary, r5l_do_reclaim() is called from the thread can't be
used to determine whether we need the deadlock avoidness.

Thanks,
Shaohua
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html