On 12/12/2013 05:23 AM, Ingo Molnar wrote: > * John Stultz <john.stultz@xxxxxxxxxx> wrote: > >> As part of normal operaions, the hrtimer subsystem frequently calls >> into the timekeeping code, creating a locking order of >> hrtimer locks -> timekeeping locks >> >> clock_was_set_delayed() was suppoed to allow us to avoid deadlocks >> between the timekeeping the hrtimer subsystem, so that we could >> notify the hrtimer subsytem the time had changed while holding >> the timekeeping locks. This was done by scheduling delayed work >> that would run later once we were out of the timekeeing code. >> >> But unfortunately the lock chains are complex enoguh that in >> scheduling delayed work, we end up eventually trying to grab >> an hrtimer lock. >> >> Sasha Levin noticed this in testing when the new seqlock lockdep >> enablement triggered the following (somewhat abrieviated) message: >> >> [ 251.100221] ====================================================== >> [ 251.100221] [ INFO: possible circular locking dependency detected ] >> [ 251.100221] 3.13.0-rc2-next-20131206-sasha-00005-g8be2375-dirty #4053 Not tainted >> [ 251.101967] ------------------------------------------------------- >> [ 251.101967] kworker/10:1/4506 is trying to acquire lock: >> [ 251.101967] (timekeeper_seq){----..}, at: [<ffffffff81160e96>] retrigger_next_event+0x56/0x70 >> [ 251.101967] >> [ 251.101967] but task is already holding lock: >> [ 251.101967] (hrtimer_bases.lock#11){-.-...}, at: [<ffffffff81160e7c>] retrigger_next_event+0x3c/0x70 >> [ 251.101967] >> [ 251.101967] which lock already depends on the new lock. >> [ 251.101967] >> [ 251.101967] >> [ 251.101967] the existing dependency chain (in reverse order) is: >> [ 251.101967] >> -> #5 (hrtimer_bases.lock#11){-.-...}: >> [snipped] >> -> #4 (&rt_b->rt_runtime_lock){-.-...}: >> [snipped] >> -> #3 (&rq->lock){-.-.-.}: >> [snipped] >> -> #2 (&p->pi_lock){-.-.-.}: >> [snipped] >> -> #1 (&(&pool->lock)->rlock){-.-...}: >> [ 251.101967] [<ffffffff81194803>] validate_chain+0x6c3/0x7b0 >> [ 251.101967] [<ffffffff81194d9d>] __lock_acquire+0x4ad/0x580 >> [ 251.101967] [<ffffffff81194ff2>] lock_acquire+0x182/0x1d0 >> [ 251.101967] [<ffffffff84398500>] _raw_spin_lock+0x40/0x80 >> [ 251.101967] [<ffffffff81153e69>] __queue_work+0x1a9/0x3f0 >> [ 251.101967] [<ffffffff81154168>] queue_work_on+0x98/0x120 >> [ 251.101967] [<ffffffff81161351>] clock_was_set_delayed+0x21/0x30 >> [ 251.101967] [<ffffffff811c4bd1>] do_adjtimex+0x111/0x160 >> [ 251.101967] [<ffffffff811e2711>] compat_sys_adjtimex+0x41/0x70 >> [ 251.101967] [<ffffffff843a4b49>] ia32_sysret+0x0/0x5 >> [ 251.101967] >> -> #0 (timekeeper_seq){----..}: >> [snipped] >> [ 251.101967] other info that might help us debug this: >> [ 251.101967] >> [ 251.101967] Chain exists of: >> timekeeper_seq --> &rt_b->rt_runtime_lock --> hrtimer_bases.lock#11 >> >> [ 251.101967] Possible unsafe locking scenario: >> [ 251.101967] >> [ 251.101967] CPU0 CPU1 >> [ 251.101967] ---- ---- >> [ 251.101967] lock(hrtimer_bases.lock#11); >> [ 251.101967] lock(&rt_b->rt_runtime_lock); >> [ 251.101967] lock(hrtimer_bases.lock#11); >> [ 251.101967] lock(timekeeper_seq); >> [ 251.101967] >> [ 251.101967] *** DEADLOCK *** >> [ 251.101967] >> [ 251.101967] 3 locks held by kworker/10:1/4506: >> [ 251.101967] #0: (events){.+.+.+}, at: [<ffffffff81154960>] process_one_work+0x200/0x530 >> [ 251.101967] #1: (hrtimer_work){+.+...}, at: [<ffffffff81154960>] process_one_work+0x200/0x530 >> [ 251.101967] #2: (hrtimer_bases.lock#11){-.-...}, at: [<ffffffff81160e7c>] retrigger_next_event+0x3c/0x70 >> [ 251.101967] >> [ 251.101967] stack backtrace: >> [ 251.101967] CPU: 10 PID: 4506 Comm: kworker/10:1 Not tainted 3.13.0-rc2-next-20131206-sasha-00005-g8be2375-dirty #4053 >> [ 251.101967] Workqueue: events clock_was_set_work >> >> So the best solution is to avoid calling clock_was_set_delayed() while >> holding the timekeeping lock, and instead using a flag variable to >> decide if we should call clock_was_set() once we've released the locks. >> >> This works for the case here, where the do_adjtimex() was the deadlock >> trigger point. Unfortuantely, in update_wall_time() we still hold >> the jiffies lock, which would deadlock with the ipi triggered by >> clock_was_set(), preventing us from calling it even after we drop the >> timekeeping lock. So instead call clock_was_set_delayed() at that point. >> >> Cc: Thomas Gleixner <tglx@xxxxxxxxxxxxx> >> Cc: Prarit Bhargava <prarit@xxxxxxxxxx> >> Cc: Richard Cochran <richardcochran@xxxxxxxxx> >> Cc: Ingo Molnar <mingo@xxxxxxxxxx> >> Cc: Sasha Levin <sasha.levin@xxxxxxxxxx> >> Cc: stable <stable@xxxxxxxxxxxxxxx> #3.10+ >> Reported-by: Sasha Levin <sasha.levin@xxxxxxxxxx> >> Signed-off-by: John Stultz <john.stultz@xxxxxxxxxx> >> --- >> kernel/time/timekeeping.c | 18 ++++++++++++++++-- >> 1 file changed, 16 insertions(+), 2 deletions(-) >> >> diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c >> index 998ec751..c1d36b6 100644 >> --- a/kernel/time/timekeeping.c >> +++ b/kernel/time/timekeeping.c >> @@ -1278,7 +1278,6 @@ static inline unsigned int accumulate_nsecs_to_secs(struct timekeeper *tk) >> >> __timekeeping_set_tai_offset(tk, tk->tai_offset - leap); >> >> - clock_was_set_delayed(); >> action = TK_CLOCK_WAS_SET; >> } >> } >> @@ -1440,6 +1439,19 @@ static void update_wall_time(void) >> write_seqcount_end(&timekeeper_seq); >> out: >> raw_spin_unlock_irqrestore(&timekeeper_lock, flags); >> + if (action & TK_CLOCK_WAS_SET) { >> + /* >> + * XXX - I'd rather we just call clock_was_set(), but >> + * since we're currently holding the jiffies lock, calling >> + * clock_was_set would trigger an ipi which would then grab >> + * the jiffies lock and we'd deadlock. :( >> + * The right solution should probably be droping >> + * the jiffies lock before calling update_wall_time >> + * but that requires some rework of the tick sched >> + * code. > s/droping/ > dropping > >> + */ >> + clock_was_set_delayed(); > Hm, this feels like a hack. Is the 'rework of the tick sched code' > going to happen too? Yea. It is sort of a hack, since the whole purpose of clock_was_set_delayed() was that it was supposed to be safe to call deep in the timekeeping code. We also need to backport this i-don't-know-how-far-back to -stable, and that may include kernels where the jiffies lock and timekeping lock aren't yet split apart. So I suspect the hack-ish fix with following cleanup is the best route. I'll try to spin up a first draft of the cleanup patch so we can at least have that discussed and hopefully ready to be queued. thanks -john -- To unsubscribe from this list: send the line "unsubscribe stable" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html