Re: [PATCH PREEMPT RT] rt-mutex: fix deadlock in device mapper

Mikulas Patocka <mpatocka@xxxxxxxxxx> · Mon, 20 Nov 2017 16:33:17 -0500 (EST)

On Sat, 18 Nov 2017, Mike Galbraith wrote:

> On Fri, 2017-11-17 at 15:57 +0100, Sebastian Siewior wrote:
> > On 2017-11-13 12:56:53 [-0500], Mikulas Patocka wrote:
> > > Hi
> > Hi,
> > 
> > > I'm submitting this patch for the CONFIG_PREEMPT_RT patch. It fixes 
> > > deadlocks in device mapper when real time preemption is used.
> > 
> > applied, thank you.
> 
> Below is my 2012 3.0-rt version of that for reference; at that time we
> were using slab, and slab_lock referenced below was a local_lock.  The
> comment came from crash analysis of a deadlock I met before adding the
> (yeah, hacky) __migrate_disabled() blocker.  At the time, that was not
> an optional hack, you WOULD deadlock if you ground disks to fine powder
> the way SUSE QA did.  Pulling the plug before blocking cured the
> xfs/ext[34] IO deadlocks they griped about, but I had to add that hack
> to not trade their nasty IO deadlocks for the more mundane variety.  So
> my question is: are we sure that in the here and now flush won't want
> any lock we may be holding?  In days of yore, it most definitely would
> turn your box into a doorstop if you tried hard enough.

I think that you shouldn't call blk_schedule_flush_plug when attempting to 
take a rt-spinlock at all.

Rt-mutex can't depend on rt-spinlock (altough they use the same internal 
implementation), so a task waiting on rt-spinlock can't block any other 
task attempting to take rt-mutex from making progress.

Is there some specific scenario where you need to call 
blk_schedule_flush_plug from rt_spin_lock_fastlock?

Mikulas

> Subject: rt: pull your plug before blocking
> Date: Wed Jul 18 14:43:15 CEST 2012
> From: Mike Galbraith <efault@xxxxxx>
> 
> Queued IO can lead to IO deadlock should a task require wakeup from a task
> which is blocked on that queued IO.
> 
> ext3: dbench1 queues a buffer, blocks on journal mutex, it's plug is not
> pulled.  dbench2 mutex owner is waiting for kjournald, who is waiting for
> the buffer queued by dbench1.  Game over.
> 
> Signed-off-by: Mike Galbraith <efault@xxxxxx>
> ---
>  kernel/locking/rtmutex.c |   18 ++++++++++++++++--
>  1 file changed, 16 insertions(+), 2 deletions(-)
> 
> --- a/kernel/locking/rtmutex.c
> +++ b/kernel/locking/rtmutex.c
> @@ -22,6 +22,7 @@
>  #include <linux/sched/deadline.h>
>  #include <linux/timer.h>
>  #include <linux/ww_mutex.h>
> +#include <linux/blkdev.h>
>  
>  #include "rtmutex_common.h"
>  
> @@ -930,8 +931,18 @@ static inline void rt_spin_lock_fastlock
>  
>  	if (likely(rt_mutex_cmpxchg_acquire(lock, NULL, current)))
>  		rt_mutex_deadlock_account_lock(lock, current);
> -	else
> +	else {
> +		/*
> +		 * We can't pull the plug if we're already holding a lock
> +		 * else we can deadlock.  eg, if we're holding slab_lock,
> +		 * ksoftirqd can block while processing BLOCK_SOFTIRQ after
> +		 * having acquired q->queue_lock.  If _we_ then block on
> +		 * that q->queue_lock while flushing our plug, deadlock.
> +		 */
> +		if (__migrate_disabled(current) < 2 && blk_needs_flush_plug(current))
> +			blk_schedule_flush_plug(current);
>  		slowfn(lock);
> +	}
>  }
>  
>  static inline void rt_spin_lock_fastunlock(struct rt_mutex *lock,
> @@ -1892,9 +1903,12 @@ rt_mutex_fastlock(struct rt_mutex *lock,
>  	if (likely(rt_mutex_cmpxchg_acquire(lock, NULL, current))) {
>  		rt_mutex_deadlock_account_lock(lock, current);
>  		return 0;
> -	} else
> +	} else {
> +		if (blk_needs_flush_plug(current))
> +			blk_schedule_flush_plug(current);
>  		return slowfn(lock, state, NULL, RT_MUTEX_MIN_CHAINWALK,
>  			      ww_ctx);
> +	}
>  }
>  
>  static inline int
>