Re: [RFC][PATCH 3/5] timekeeping: Avoid possible deadlock from clock_was_set_delayed

John Stultz <john.stultz@xxxxxxxxxx> · Thu, 12 Dec 2013 10:53:26 -0800

On 12/12/2013 05:23 AM, Ingo Molnar wrote:
> * John Stultz <john.stultz@xxxxxxxxxx> wrote:
>
>> As part of normal operaions, the hrtimer subsystem frequently calls
>> into the timekeeping code, creating a locking order of
>>   hrtimer locks -> timekeeping locks
>>
>> clock_was_set_delayed() was suppoed to allow us to avoid deadlocks
>> between the timekeeping the hrtimer subsystem, so that we could
>> notify the hrtimer subsytem the time had changed while holding
>> the timekeeping locks. This was done by scheduling delayed work
>> that would run later once we were out of the timekeeing code.
>>
>> But unfortunately the lock chains are complex enoguh that in
>> scheduling delayed work, we end up eventually trying to grab
>> an hrtimer lock.
>>
>> Sasha Levin noticed this in testing when the new seqlock lockdep
>> enablement triggered the following (somewhat abrieviated) message:
>>
>> [  251.100221] ======================================================
>> [  251.100221] [ INFO: possible circular locking dependency detected ]
>> [  251.100221] 3.13.0-rc2-next-20131206-sasha-00005-g8be2375-dirty #4053 Not tainted
>> [  251.101967] -------------------------------------------------------
>> [  251.101967] kworker/10:1/4506 is trying to acquire lock:
>> [  251.101967]  (timekeeper_seq){----..}, at: [<ffffffff81160e96>] retrigger_next_event+0x56/0x70
>> [  251.101967]
>> [  251.101967] but task is already holding lock:
>> [  251.101967]  (hrtimer_bases.lock#11){-.-...}, at: [<ffffffff81160e7c>] retrigger_next_event+0x3c/0x70
>> [  251.101967]
>> [  251.101967] which lock already depends on the new lock.
>> [  251.101967]
>> [  251.101967]
>> [  251.101967] the existing dependency chain (in reverse order) is:
>> [  251.101967]
>> -> #5 (hrtimer_bases.lock#11){-.-...}:
>> [snipped]
>> -> #4 (&rt_b->rt_runtime_lock){-.-...}:
>> [snipped]
>> -> #3 (&rq->lock){-.-.-.}:
>> [snipped]
>> -> #2 (&p->pi_lock){-.-.-.}:
>> [snipped]
>> -> #1 (&(&pool->lock)->rlock){-.-...}:
>> [  251.101967]        [<ffffffff81194803>] validate_chain+0x6c3/0x7b0
>> [  251.101967]        [<ffffffff81194d9d>] __lock_acquire+0x4ad/0x580
>> [  251.101967]        [<ffffffff81194ff2>] lock_acquire+0x182/0x1d0
>> [  251.101967]        [<ffffffff84398500>] _raw_spin_lock+0x40/0x80
>> [  251.101967]        [<ffffffff81153e69>] __queue_work+0x1a9/0x3f0
>> [  251.101967]        [<ffffffff81154168>] queue_work_on+0x98/0x120
>> [  251.101967]        [<ffffffff81161351>] clock_was_set_delayed+0x21/0x30
>> [  251.101967]        [<ffffffff811c4bd1>] do_adjtimex+0x111/0x160
>> [  251.101967]        [<ffffffff811e2711>] compat_sys_adjtimex+0x41/0x70
>> [  251.101967]        [<ffffffff843a4b49>] ia32_sysret+0x0/0x5
>> [  251.101967]
>> -> #0 (timekeeper_seq){----..}:
>> [snipped]
>> [  251.101967] other info that might help us debug this:
>> [  251.101967]
>> [  251.101967] Chain exists of:
>>   timekeeper_seq --> &rt_b->rt_runtime_lock --> hrtimer_bases.lock#11
>>
>> [  251.101967]  Possible unsafe locking scenario:
>> [  251.101967]
>> [  251.101967]        CPU0                    CPU1
>> [  251.101967]        ----                    ----
>> [  251.101967]   lock(hrtimer_bases.lock#11);
>> [  251.101967]                                lock(&rt_b->rt_runtime_lock);
>> [  251.101967]                                lock(hrtimer_bases.lock#11);
>> [  251.101967]   lock(timekeeper_seq);
>> [  251.101967]
>> [  251.101967]  *** DEADLOCK ***
>> [  251.101967]
>> [  251.101967] 3 locks held by kworker/10:1/4506:
>> [  251.101967]  #0:  (events){.+.+.+}, at: [<ffffffff81154960>] process_one_work+0x200/0x530
>> [  251.101967]  #1:  (hrtimer_work){+.+...}, at: [<ffffffff81154960>] process_one_work+0x200/0x530
>> [  251.101967]  #2:  (hrtimer_bases.lock#11){-.-...}, at: [<ffffffff81160e7c>] retrigger_next_event+0x3c/0x70
>> [  251.101967]
>> [  251.101967] stack backtrace:
>> [  251.101967] CPU: 10 PID: 4506 Comm: kworker/10:1 Not tainted 3.13.0-rc2-next-20131206-sasha-00005-g8be2375-dirty #4053
>> [  251.101967] Workqueue: events clock_was_set_work
>>
>> So the best solution is to avoid calling clock_was_set_delayed() while
>> holding the timekeeping lock, and instead using a flag variable to
>> decide if we should call clock_was_set() once we've released the locks.
>>
>> This works for the case here, where the do_adjtimex() was the deadlock
>> trigger point. Unfortuantely, in update_wall_time() we still hold
>> the jiffies lock, which would deadlock with the ipi triggered by
>> clock_was_set(), preventing us from calling it even after we drop the
>> timekeeping lock. So instead call clock_was_set_delayed() at that point.
>>
>> Cc: Thomas Gleixner <tglx@xxxxxxxxxxxxx>
>> Cc: Prarit Bhargava <prarit@xxxxxxxxxx>
>> Cc: Richard Cochran <richardcochran@xxxxxxxxx>
>> Cc: Ingo Molnar <mingo@xxxxxxxxxx>
>> Cc: Sasha Levin <sasha.levin@xxxxxxxxxx>
>> Cc: stable <stable@xxxxxxxxxxxxxxx> #3.10+
>> Reported-by: Sasha Levin <sasha.levin@xxxxxxxxxx>
>> Signed-off-by: John Stultz <john.stultz@xxxxxxxxxx>
>> ---
>>  kernel/time/timekeeping.c | 18 ++++++++++++++++--
>>  1 file changed, 16 insertions(+), 2 deletions(-)
>>
>> diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
>> index 998ec751..c1d36b6 100644
>> --- a/kernel/time/timekeeping.c
>> +++ b/kernel/time/timekeeping.c
>> @@ -1278,7 +1278,6 @@ static inline unsigned int accumulate_nsecs_to_secs(struct timekeeper *tk)
>>  
>>  			__timekeeping_set_tai_offset(tk, tk->tai_offset - leap);
>>  
>> -			clock_was_set_delayed();
>>  			action = TK_CLOCK_WAS_SET;
>>  		}
>>  	}
>> @@ -1440,6 +1439,19 @@ static void update_wall_time(void)
>>  	write_seqcount_end(&timekeeper_seq);
>>  out:
>>  	raw_spin_unlock_irqrestore(&timekeeper_lock, flags);
>> +	if (action & TK_CLOCK_WAS_SET) {
>> +		/*
>> +		 * XXX -  I'd rather we just call clock_was_set(), but
>> +		 * since we're currently holding the jiffies lock, calling
>> +		 * clock_was_set would trigger an ipi which would then grab
>> +		 * the jiffies lock and we'd deadlock. :(
>> +		 * The right solution should probably be droping
>> +		 * the jiffies lock before calling update_wall_time
>> +		 * but that requires some rework of the tick sched
>> +		 * code.
> s/droping/
>   dropping
>
>> +		 */
>> +		clock_was_set_delayed();
> Hm, this feels like a hack. Is the 'rework of the tick sched code' 
> going to happen too?

Yea. It is sort of a hack, since the whole purpose of
clock_was_set_delayed() was that it was supposed to be safe to call deep
in the timekeeping code.

We also need to backport this i-don't-know-how-far-back to -stable, and
that may include kernels where the jiffies lock and timekeping lock
aren't yet split apart. So I suspect the hack-ish fix with following
cleanup is the best route.

I'll try to spin up a first draft of the cleanup patch so we can at
least have that discussed and hopefully ready to be queued.

thanks
-john

--
To unsubscribe from this list: send the line "unsubscribe stable" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html