On 12/03/2014 09:56 AM, Juerg Haefliger wrote: > Added lots of tracing and here is the sequence of events that lead to the deadlock: > > mutex A is free > mutex B is free> > > eth1: acquires mutex B (rt_spin_lock_nested) > > iperf: wants to acquire mutex B (rt_spin_lock) > iperf: blocks on rtmutex (mutex B, owner eth1) > iperf: enqueue iperf on mutex B > iperf: top waiter of mutex B is iperf > iperf: enqueue_pi iperf on eth1 > iperf: chain_walk = 0 > iperf: for (;;) > iperf: __try_to_take_rt_mutex(mutex B, owner eth1) > > At this point iperf is in a tight loop trying to take mutex B > > iperf: receives an smp_apic_timer_interrupt (which yankes it out of > the previous for (;;) loop) > iperf: acquires mutex A (rt_spin_trylock) > > eth1: wants to acquire mutex A (rt_spin_lock) > eth1: gets wait_lock of mutex A > eth1: blocks on rtmutex (mutex A, owner iperf) > > iperf: tries to release mutex A (rt_spin_unlock_after_trylock_in_irq) > iperf: spins on trying to get wait_lock of mutex A (in > rt_spin_lock_slowunlock_hirq) > > eth1: enqueue eth1 on mutex A > eth1: top waiter of mutex A is eth1 > eth1: enqueue_pi eth1 on iperf > eth1: adjust prio of iperf from 120 to 49 > eth1: iperf is blocked on mutex B > eth1: chain_walk = 1 > eth1: releases wait_lock of mutex A > > iperf: gets wait_lock of mutex A > iperf: __rt_spin_lock_slowunlock mutex A > > eth1: rt_mutex_adjust_prio_chain(task=iperf, orig_lock=mutex A, > next_lock=mutex B, orig_waiter=eth1,mutex A, top_task=eth1) > eth1: waiter is iperf,mutex B > eth1: lock is mutex B > eth1: raw_spin_trylock mutex B > eth1: rt_mutex_owner(mutex B) == eth1 ---> -EDEADLOCK > > Does that make sense? Please let me know if I need to provide more > info. I have detailed ftrace and crash dumps that I could provided if > need be. > Is this a 'real' problem or is the above fix (ignore the deadlock) > sufficient? > > Thanks > ...Juerg I believe this is a problem with spin_do_trylock()->rt_mutex_trylock() called by update_process_times as a result of the smp_apic_timer_interrupt in the interrupt context. rt_mutex_trylock(), both the fast and slow paths, uses current (current running task) as the mutex's owner. In the interrupt context, though current is readable it is invalid and should not be used as the owner of the mutex. The mutex's real owner is the interrupt handler, not the interrupted executing task. Using current as the mutex's owner causes the false negative in rt_mutex_adjust_prio_chain(), predicting a deadlock, when it is not really a deadlock. >From your scenario above, assuming that there is a way to indicate a mutex is owned by interrupt handler rather than the interrupted iperf, the portion of iperf interrupted by sm_apic_timer_interrupt would be rephrased as sm_apic_timer_interrupt_handler and the above scenario would now be, iperf receives an smp_apic_timer_interrupt (which yankes it out of the previous loop) sm_apic_timer_interrupt_handler acquires mutex A (rt_spin_trylock) eth1 wants to acqsire mutex A (rt_spin_lock) eth1 gets wait_lock of mutex A eth1 blocks on rtmutex (mutex A, owner sm_apic_timer_interrupt_handler) sm_apic_timer_interrupt_handler tries to release mutex A (rt_spin_unlock_after_trylock_in_irq) sm_apic_timer_interrupt_handler spins on trying to get wait_lock of mutex A (owner eth1) eth1: enqueue eth1 on mutex A eth1: top waiter of mutex A is eth1 eth1: enqueue_pi eth1 on sm_apic_timer_interrupt_handler eth1: adjust prio of sm_apic_timer_interrupt_handler from 120 to 49 eth1: sm_apic_timer_interrupt_handler is not blocked on mutex B eth1: chain_walk = 0 eth1: releases wait_lock of mutex A sm_apic_timer_interrupt_handler gets wait_lock of mutex A __rt_spin_lock_slowunlock mutex A Instead of, eth1: rt_mutex_adjust_prio_chain(task=sm_apic_timer_interrupt_handler, orig_lock=mutex A, next_lock=mutex B, orig_waiter=eth1,mutex A, top_task=eth1) eth1: waiter is iperf,mutex B eth1: lock is mutex B eth1: raw_spin_trylock mutex B eth1: rt_mutex_owner(mutex B) == eth1 ---> -EDEADLOCK we'll have, eth1: both chain_walk == 0 and next_lock == NULL(instead of chain_walk == 1 and next_lock == mutex B, since sm_apic_timer_interrupt_handler is not actually waiting for mutex B) eth1: return 0 ---> No deadlock. As mentioned, the fact that rt_mutex_trylock() inaccurately attributed the owner of mutex B to iperf causes rt_mutex_adjust_prio_chain() to inacurrately announce a deadlock. In this situation there will be no deadlock as sm_apic_timer_interrupt_handler will eventually complete the interrupt handling and release mutex A and return to the interrupted iperf which is to continue waiting on mutex B. eth1 will eventually get mutex A and releases both mutex B and A, and eventually iperf will get mutex B without any deadlock. At the moment, I can think of two ways to fix the problem. One, to change the mutual exclusion mechanism of timer->base to use real spin_lock, raw_spin-lock_t in RT patch, instead of rt_mutex. But looks like a lot of thought and work has already been done to make sure that sm_apic_timer_interrupt handler works with rt_mutex (particularly with a special non-blocking rt_mutex locking path(), spin_do_trylock(), that can be safely used in interrupt context, get_next_timer_interrupt() and update_process_times()). This approach however would require changing complex logic in kernel/timer.c module and would be a high risk change. Another approach is the come up with a mechanism to accurately attribute mutex's owner to an interrupt handler when acquired in an interrupt context, instead of the interrupted current task. This approach would involve only changes in the rt_mutex.c module and is the one I'm looking at. What I'm trying to do is in rt_mutex_fasttrylock() and rt_mutex_fastunlock(), the two functions callable in the interrupt context, when called in the interrupt context instead of using current as the owner, uses a dummy (struct task_struct *)~(0xf) as the owner. Unfortunately this would involve additional changes not to do additional work such as rt_mutex_enqueue_pi(), rt_mutex_enqueue_pi() and __rt_mutex_adjust_prio(), etc., when the owner is a dummy interrupt handler task. With all the required changes, this seems to be a lower risk and preferable change to the first approach. At the moment I'm still debugging the above changes. Will send it out for comments once I get it to work. Please let me know if you or anyone else has any question, comment and/or suggestion. Thanks in advance. Thanks, Mak. -- To unsubscribe from this list: send the line "unsubscribe linux-rt-users" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html