Re: 2.6.33.[56]-rt23: howto create repeatable explosion in wakeup_next_waiter()

Darren Hart <dvhltc@xxxxxxxxxx> · Fri, 09 Jul 2010 13:05:20 -0700

On 07/07/2010 04:57 AM, Thomas Gleixner wrote:
> Cc'ing Darren.
> 
> On Wed, 7 Jul 2010, Mike Galbraith wrote:
> 
>> Greetings,
>>
>> Stress testing, looking to trigger RCU stalls, I've managed to find a
>> way to repeatably create fireworks.  (got RCU stall, see attached)
>> 4. run it.
>>
>> What happens here is we hit WARN_ON(pendowner->pi_blocked_on != waiter),
>> this does not make it to consoles (poking sysrq-foo doesn't either).
>> Next comes WARN_ON(!pendowner->pi_blocked_on), followed by the NULL
>> explosion, which does make it to consoles.
>>
>> With explosion avoidance, I also see pendowner->pi_blocked_on->task ==
>> NULL at times, but that, as !pendowner->pi_blocked_on, seems to be
>> fallout.  The start of bad juju is always pi_blocked_on != waiter.
>>
>> [  141.609268] BUG: unable to handle kernel NULL pointer dereference at 0000000000000058
>> [  141.609268] IP: [<ffffffff8106856d>] wakeup_next_waiter+0x12c/0x177
>> [  141.609268] PGD 20e174067 PUD 20e253067 PMD 0
>> [  141.609268] Oops: 0000 [#1] PREEMPT SMP
>> [  141.609268] last sysfs file: /sys/devices/system/cpu/cpu3/cache/index2/shared_cpu_map
>> [  141.609268] CPU 0
>> [  141.609268] Pid: 8154, comm: pthread_cond_ma Tainted: G        W  2.6.33.6-rt23 #12 MS-7502/MS-7502
>> [  141.609268] RIP: 0010:[<ffffffff8106856d>]  [<ffffffff8106856d>] wakeup_next_waiter+0x12c/0x177
>> [  141.609268] RSP: 0018:ffff88020e3cdd78  EFLAGS: 00010097
>> [  141.609268] RAX: 0000000000000000 RBX: ffff8801e8eba5c0 RCX: 0000000000000000
>> [  141.609268] RDX: ffff880028200000 RSI: 0000000000000046 RDI: 0000000000000009
>> [  141.609268] RBP: ffff88020e3cdda8 R08: 0000000000000002 R09: 0000000000000000
>> [  141.609268] R10: 0000000000000005 R11: 0000000000000000 R12: ffffffff81659068
>> [  141.609268] R13: ffff8801e8ebdb58 R14: 0000000000000000 R15: ffff8801e8ebac08
>> [  141.609268] FS:  00007f664d539700(0000) GS:ffff880028200000(0000) knlGS:0000000000000000
>> [  141.609268] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [  141.609268] CR2: 0000000000000058 CR3: 0000000214266000 CR4: 00000000000006f0
>> [  141.609268] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>> [  141.609268] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
>> [  141.609268] Process pthread_cond_ma (pid: 8154, threadinfo ffff88020e3cc000, task ffff88020e2a4700)
>> [  141.609268] Stack:
>> [  141.609268]  0000000000000000 ffffffff81659068 0000000000000202 0000000000000000
>> [  141.609268]<0>  0000000000000000 0000000080001fda ffff88020e3cddc8 ffffffff812fec48
>> [  141.609268]<0>  ffffffff81659068 0000000000606300 ffff88020e3cddd8 ffffffff812ff1b9
>> [  141.609268] Call Trace:
>> [  141.609268]  [<ffffffff812fec48>] rt_spin_lock_slowunlock+0x43/0x61
>> [  141.609268]  [<ffffffff812ff1b9>] rt_spin_unlock+0x46/0x48
>> [  141.609268]  [<ffffffff81067d7f>] do_futex+0x83c/0x935
>> [  141.609268]  [<ffffffff810c26ce>] ? handle_mm_fault+0x6de/0x6f1
>> [  141.609268]  [<ffffffff81067e36>] ? do_futex+0x8f3/0x935
>> [  141.609268]  [<ffffffff81067fba>] sys_futex+0x142/0x154
>> [  141.609268]  [<ffffffff81020eb0>] ? do_page_fault+0x23e/0x28e
>> [  141.609268]  [<ffffffff81004aa7>] ? math_state_restore+0x3d/0x3f
>> [  141.609268]  [<ffffffff81004b08>] ? do_device_not_available+0xe/0x12
>> [  141.609268]  [<ffffffff81002c5b>] system_call_fastpath+0x16/0x1b
>> [  141.609268] Code: c7 09 6d 41 81 e8 ac 34 fd ff 4c 39 ab 70 06 00 00 74 11 be 47 02 00 00 48 c7 c7 09 6d 41 81 e8 92 34 fd ff 48 8b 83 70 06 00 00<4c>  39 60 58 74 11 be 48 02 00 00 48 c7 c7 09 6d 41 81 e8 74 34
>> [  141.609268] RIP  [<ffffffff8106856d>] wakeup_next_waiter+0x12c/0x177
>> [  141.609268]  RSP<ffff88020e3cdd78>
>> [  141.609268] CR2: 0000000000000058
>> [  141.609268] ---[ end trace 58805b944e6f93ce ]---
>> [  141.609268] note: pthread_cond_ma[8154] exited with preempt_count 2
>>
>> (5. eyeball locks.. ->  zzzzt ->  report ->  eyeball..)
>>
>> 	-Mike
>>

OK, here is the final analysis. After a trace-cmd fix and some rtmutex
"paradigm" insight from Rostedt, the root of the problem came out:

Jul  9 14:26:39 elm9m94 kernel: ------------[ cut here ]------------
Jul  9 14:26:39 elm9m94 kernel: WARNING: at kernel/rtmutex.c:590 wakeup_next_waiter+0x244/0x370()
Jul  9 14:26:39 elm9m94 kernel: Hardware name: IBM eServer BladeCenter HS21 -[7995AC1]-
Jul  9 14:26:39 elm9m94 kernel: Modules linked in: scsi_wait_scan
Jul  9 14:26:39 elm9m94 kernel: Pid: 3870, comm: pthread_cond_ma Not tainted 2.6.33.5-rt23dvh01 #13
Jul  9 14:26:39 elm9m94 kernel: Call Trace:
Jul  9 14:26:39 elm9m94 kernel: [<ffffffff8104dc3b>] warn_slowpath_common+0x7b/0xc0
Jul  9 14:26:39 elm9m94 kernel: [<ffffffff8104dc94>] warn_slowpath_null+0x14/0x20
Jul  9 14:26:39 elm9m94 kernel: [<ffffffff81083f74>] wakeup_next_waiter+0x244/0x370
Jul  9 14:26:39 elm9m94 kernel: [<ffffffff81921875>] rt_mutex_slowunlock+0x35/0x80
Jul  9 14:26:39 elm9m94 kernel: [<ffffffff819218e9>] rt_mutex_unlock+0x29/0x40
Jul  9 14:26:39 elm9m94 kernel: [<ffffffff8108258c>] do_futex+0xa6c/0xae0

# addr2line -e vmlinux ffffffff8108258c
/test/dvhart/source/linux-2.6-tip.git/kernel/futex.c:833
In futex_unlock_pi() unlocking the pi futex backing the pthread_mutex:

        rt_mutex_unlock(&pi_state->pi_mutex);

and a trace of the pending owner with the changing waiter (pi_blocked_on) from
sched_show_task():

Jul  9 14:26:39 elm9m94 kernel: pthread_cond_ M ffffffff81922537     0  4176   3745 0x00000080
Jul  9 14:26:39 elm9m94 kernel: ffffffff821fc9a8 0000000000000000 ffff88042ded9c08 ffffffff819216c0
Jul  9 14:26:39 elm9m94 kernel: 0000000000000000 ffff88042dbbabc0 ffffffff821fc9c0 ffff88042decee40
Jul  9 14:26:39 elm9m94 kernel: ffff88042decee40 ffff88042decee40 ffff880400000000 ffff88042ded9b70
Jul  9 14:26:39 elm9m94 kernel: Call Trace:
Jul  9 14:26:39 elm9m94 kernel: [<ffffffff819216c0>] ? rt_spin_lock_slowlock+0x190/0x310
Jul  9 14:26:39 elm9m94 kernel: [<ffffffff8192216c>] ? rt_spin_lock+0x7c/0x90
Jul  9 14:26:39 elm9m94 kernel: [<ffffffff810810c8>] ? futex_wait_requeue_pi+0x178/0x360

# addr2line -e vmlinux ffffffff810810c8
/test/dvhart/source/linux-2.6-tip.git/kernel/futex.c:153

That's inside match_futex(), back up a line:

# addr2line -e vmlinux ffffffff810810c0
/test/dvhart/source/linux-2.6-tip.git/kernel/futex.c:2258

This is futex_wait_requeue_pi():
        /* Queue the futex_q, drop the hb lock, wait for wakeup. */
        futex_wait_queue_me(hb, &q, to);

        spin_lock(&hb->lock);
        ret = handle_early_requeue_pi_wakeup(hb, &q, &key2, to);
        spin_unlock(&hb->lock);

Snippets from the trace:

 pthread_cond_ma-4176  [003]   395.357310: bprint:               task_blocks_on_rt_mutex : 4176 blocks on waiter:
 pthread_cond_ma-4176  [003]   395.357310: bprint:               task_blocks_on_rt_mutex :   waiter: 0xffff88042ded9b68
 pthread_cond_ma-4176  [003]   395.357311: bprint:               task_blocks_on_rt_mutex :   waiter->lock: 0xffffffff821fc9a8
 pthread_cond_ma-4176  [003]   395.357311: bprint:               task_blocks_on_rt_mutex :   was: 0xffff88042ded9cc8
                                                                                             ^ should always be NULL
 pthread_cond_ma-4176  [003]   395.357311: bprint:               task_blocks_on_rt_mutex :   oldwaiter->lock: 0xffff8804104d0450

 pthread_cond_ma-3870  [000]   395.357732: bprint:               wakeup_next_waiter : BUG: waiter changed
 pthread_cond_ma-3870  [000]   395.357733: bprint:               wakeup_next_waiter : pendowner pid: 4176
 pthread_cond_ma-3870  [000]   395.357734: bprint:               wakeup_next_waiter : pendowner comm:
 pthread_cond_ma-3870  [000]   395.357734: bprint:               wakeup_next_waiter : waiter->lock: 0xffff8804104d0450
 pthread_cond_ma-3870  [000]   395.357734: bprint:               wakeup_next_waiter : pi_blocked_on->lock: 0xffffffff821fc9a8

The one it changed to is the one we see in the task_blocks_on_rt_mutex
above. It has woken from the futex_wait_queue_me() after having been
requeued to the pi futex (we know this because it wouldn't have a
pi_blocked_on if it was still on the wait queue futex. Also the
futex_unlock_pi() call by 3870 is trying to unlock the previous lock
4167 was locked on.

The core of the problem is that the proxy_lock blocks a task on a lock
the task knows nothing about. So when it wakes up inside of
futex_wait_requeue_pi, it immediately tries to block on hb->lock to
check why it woke up. This has the potential to block the task on two
locks (thus overwriting the pi_blocked_on). Any attempt preventing this
involves a lock, and ultimiately the hb->lock. The only solution I see
is to make the hb->locks raw locks (thanks to Steven Rostedt for
original idea and batting this around with me in IRC).

Given the contention the hb->locks can be under, I'm concerned about the
latency impacts this can have, still, I don't see another way to allow
for the proxy locking that the rtmutex semantics require for requeue_pi
to work. I'll work up a patch and do some testing to see if I can get a
feel for the impact.

Ugh.

I'm still hoping to find some clever way to avoid this, going to map out
this state machine again and see if there is a way around it...

-- 
Darren Hart
IBM Linux Technology Center
Real-Time Linux Team
--
To unsubscribe from this list: send the line "unsubscribe linux-rt-users" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html