Re: INFO: task hung in __io_uring_register

Jens Axboe <axboe@xxxxxxxxx> · Mon, 15 Apr 2019 10:49:25 -0600

On 4/15/19 7:52 AM, syzbot wrote:
> syzbot has found a reproducer for the following crash on:
> 
> HEAD commit:    dc4060a5 Linux 5.1-rc5
> git tree:       upstream
> console output: https://syzkaller.appspot.com/x/log.txt?x=17eacddd200000
> kernel config:  https://syzkaller.appspot.com/x/.config?x=856fc6d0fbbeede9
> dashboard link: https://syzkaller.appspot.com/bug?extid=16dc03452dee970a0c3e
> compiler:       gcc (GCC) 9.0.0 20181231 (experimental)
> userspace arch: i386
> syz repro:      https://syzkaller.appspot.com/x/repro.syz?x=14eb47ed200000
> C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=112175dd200000
> 
> IMPORTANT: if you fix the bug, please add the following tag to the commit:
> Reported-by: syzbot+16dc03452dee970a0c3e@xxxxxxxxxxxxxxxxxxxxxxxxx
> 
> INFO: task syz-executor043:8083 blocked for more than 143 seconds.
>        Not tainted 5.1.0-rc5 #68
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> syz-executor043 D29912  8083   8063 0x20020004
> Call Trace:
>   context_switch kernel/sched/core.c:2877 [inline]
>   __schedule+0x813/0x1cc0 kernel/sched/core.c:3518
>   schedule+0x92/0x180 kernel/sched/core.c:3562
>   schedule_timeout+0x8ca/0xfd0 kernel/time/timer.c:1779
>   do_wait_for_common kernel/sched/completion.c:83 [inline]
>   __wait_for_common kernel/sched/completion.c:104 [inline]
>   wait_for_common kernel/sched/completion.c:115 [inline]
>   wait_for_completion+0x29c/0x440 kernel/sched/completion.c:136
>   __io_uring_register+0xb6/0x1fd0 fs/io_uring.c:2929
>   __do_sys_io_uring_register fs/io_uring.c:2979 [inline]
>   __se_sys_io_uring_register fs/io_uring.c:2961 [inline]
>   __ia32_sys_io_uring_register+0x193/0x1f0 fs/io_uring.c:2961
>   do_syscall_32_irqs_on arch/x86/entry/common.c:326 [inline]
>   do_fast_syscall_32+0x281/0xc98 arch/x86/entry/common.c:397
>   entry_SYSENTER_compat+0x70/0x7f arch/x86/entry/entry_64_compat.S:139
> RIP: 0023:0xf7f06869
> Code: Bad RIP value.
> RSP: 002b:00000000f7f021ec EFLAGS: 00000296 ORIG_RAX: 00000000000001ab
> RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 0000000000000001
> RDX: 0000000000000000 RSI: 0000000000000000 RDI: 000000000000014a
> RBP: 00000000200000c0 R08: 0000000000000000 R09: 0000000000000000
> R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
> R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
> INFO: task syz-executor043:8084 blocked for more than 143 seconds.
>        Not tainted 5.1.0-rc5 #68
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> syz-executor043 D30320  8084   8063 0x20020004
> Call Trace:
>   context_switch kernel/sched/core.c:2877 [inline]
>   __schedule+0x813/0x1cc0 kernel/sched/core.c:3518
>   schedule+0x92/0x180 kernel/sched/core.c:3562
>   schedule_preempt_disabled+0x13/0x20 kernel/sched/core.c:3620
>   __mutex_lock_common kernel/locking/mutex.c:1002 [inline]
>   __mutex_lock+0x726/0x1310 kernel/locking/mutex.c:1072
>   mutex_lock_nested+0x16/0x20 kernel/locking/mutex.c:1087
>   __do_sys_io_uring_enter fs/io_uring.c:2678 [inline]
>   __se_sys_io_uring_enter fs/io_uring.c:2637 [inline]
>   __ia32_sys_io_uring_enter+0x67f/0xac0 fs/io_uring.c:2637
>   do_syscall_32_irqs_on arch/x86/entry/common.c:326 [inline]
>   do_fast_syscall_32+0x281/0xc98 arch/x86/entry/common.c:397
>   entry_SYSENTER_compat+0x70/0x7f arch/x86/entry/entry_64_compat.S:139
> RIP: 0023:0xf7f06869
> Code: Bad RIP value.
> RSP: 002b:00000000f7ee11ec EFLAGS: 00000292 ORIG_RAX: 00000000000001aa
> RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 0000000000010005
> RDX: 0000000000000002 RSI: 0000000000000003 RDI: 0000000000000000
> RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
> R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
> R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
> 
> Showing all locks held in the system:
> 1 lock held by khungtaskd/1042:
>   #0: 00000000fbdbf33e (rcu_read_lock){....}, at:  
> debug_show_all_locks+0x5f/0x27e kernel/locking/lockdep.c:5056
> 2 locks held by rsyslogd/7952:
>   #0: 000000001bb2c5a7 (&f->f_pos_lock){+.+.}, at: __fdget_pos+0xee/0x110  
> fs/file.c:801
>   #1: 000000007392ce44 (fs_reclaim){+.+.}, at: is_bpf_text_address+0x0/0x170  
> kernel/bpf/core.c:697
> 2 locks held by getty/8041:
>   #0: 00000000e7d67802 (&tty->ldisc_sem){++++}, at:  
> ldsem_down_read+0x33/0x40 drivers/tty/tty_ldsem.c:341
>   #1: 0000000026520570 (&ldata->atomic_read_lock){+.+.}, at:  
> n_tty_read+0x232/0x1b70 drivers/tty/n_tty.c:2156
> 2 locks held by getty/8042:
>   #0: 000000004426012c (&tty->ldisc_sem){++++}, at:  
> ldsem_down_read+0x33/0x40 drivers/tty/tty_ldsem.c:341
>   #1: 0000000004c02a9e (&ldata->atomic_read_lock){+.+.}, at:  
> n_tty_read+0x232/0x1b70 drivers/tty/n_tty.c:2156
> 2 locks held by getty/8043:
>   #0: 00000000e0209856 (&tty->ldisc_sem){++++}, at:  
> ldsem_down_read+0x33/0x40 drivers/tty/tty_ldsem.c:341
>   #1: 000000009f921bda (&ldata->atomic_read_lock){+.+.}, at:  
> n_tty_read+0x232/0x1b70 drivers/tty/n_tty.c:2156
> 2 locks held by getty/8044:
>   #0: 0000000000de2ea5 (&tty->ldisc_sem){++++}, at:  
> ldsem_down_read+0x33/0x40 drivers/tty/tty_ldsem.c:341
>   #1: 000000002674bf52 (&ldata->atomic_read_lock){+.+.}, at:  
> n_tty_read+0x232/0x1b70 drivers/tty/n_tty.c:2156
> 2 locks held by getty/8045:
>   #0: 00000000abbba1ef (&tty->ldisc_sem){++++}, at:  
> ldsem_down_read+0x33/0x40 drivers/tty/tty_ldsem.c:341
>   #1: 00000000ef32475e (&ldata->atomic_read_lock){+.+.}, at:  
> n_tty_read+0x232/0x1b70 drivers/tty/n_tty.c:2156
> 2 locks held by getty/8046:
>   #0: 000000008d742c83 (&tty->ldisc_sem){++++}, at:  
> ldsem_down_read+0x33/0x40 drivers/tty/tty_ldsem.c:341
>   #1: 00000000fa255b8a (&ldata->atomic_read_lock){+.+.}, at:  
> n_tty_read+0x232/0x1b70 drivers/tty/n_tty.c:2156
> 2 locks held by getty/8047:
>   #0: 00000000d3d56b2b (&tty->ldisc_sem){++++}, at:  
> ldsem_down_read+0x33/0x40 drivers/tty/tty_ldsem.c:341
>   #1: 0000000015242045 (&ldata->atomic_read_lock){+.+.}, at:  
> n_tty_read+0x232/0x1b70 drivers/tty/n_tty.c:2156
> 1 lock held by syz-executor043/8083:
>   #0: 000000005524ed18 (&ctx->uring_lock){+.+.}, at:  
> __do_sys_io_uring_register fs/io_uring.c:2978 [inline]
>   #0: 000000005524ed18 (&ctx->uring_lock){+.+.}, at:  
> __se_sys_io_uring_register fs/io_uring.c:2961 [inline]
>   #0: 000000005524ed18 (&ctx->uring_lock){+.+.}, at:  
> __ia32_sys_io_uring_register+0x182/0x1f0 fs/io_uring.c:2961
> 1 lock held by syz-executor043/8084:
>   #0: 000000005524ed18 (&ctx->uring_lock){+.+.}, at: __do_sys_io_uring_enter  
> fs/io_uring.c:2678 [inline]
>   #0: 000000005524ed18 (&ctx->uring_lock){+.+.}, at: __se_sys_io_uring_enter  
> fs/io_uring.c:2637 [inline]
>   #0: 000000005524ed18 (&ctx->uring_lock){+.+.}, at:  
> __ia32_sys_io_uring_enter+0x67f/0xac0 fs/io_uring.c:2637
> 
> =============================================
> 
> NMI backtrace for cpu 1
> CPU: 1 PID: 1042 Comm: khungtaskd Not tainted 5.1.0-rc5 #68
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS  
> Google 01/01/2011
> Call Trace:
>   __dump_stack lib/dump_stack.c:77 [inline]
>   dump_stack+0x172/0x1f0 lib/dump_stack.c:113
>   nmi_cpu_backtrace.cold+0x63/0xa4 lib/nmi_backtrace.c:101
>   nmi_trigger_cpumask_backtrace+0x1be/0x236 lib/nmi_backtrace.c:62
>   arch_trigger_cpumask_backtrace+0x14/0x20 arch/x86/kernel/apic/hw_nmi.c:38
>   trigger_all_cpu_backtrace include/linux/nmi.h:146 [inline]
>   check_hung_uninterruptible_tasks kernel/hung_task.c:204 [inline]
>   watchdog+0x9b7/0xec0 kernel/hung_task.c:288
>   kthread+0x357/0x430 kernel/kthread.c:253
>   ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:352
> Sending NMI from CPU 1 to CPUs 0:

This is a deadlock due to io_uring_register() waiting for existing
users to exit. But it does so with the io_uring mutex held, and
existing callers to io_uring_enter() may need that mutex to make
progress. Hence we're deadlocked - one side holding the mutex and
waiting for users to exit, the other side waiting on the mutex
to make progress and then exit.

This should fix it.

diff --git a/fs/io_uring.c b/fs/io_uring.c
index f4ddb9d23241..b35300e4c9a7 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -2929,11 +2929,23 @@ SYSCALL_DEFINE2(io_uring_setup, u32, entries,
 
 static int __io_uring_register(struct io_ring_ctx *ctx, unsigned opcode,
 			       void __user *arg, unsigned nr_args)
+	__releases(ctx->uring_lock)
+	__acquires(ctx->uring_lock)
 {
 	int ret;
 
 	percpu_ref_kill(&ctx->refs);
+
+	/*
+	 * Drop uring mutex before waiting for references to exit. If another
+	 * thread is currently inside io_uring_enter() it might need to grab
+	 * the uring_lock to make progress. If we hold it here across the drain
+	 * wait, then we can deadlock. It's safe to drop the mutex here, since
+	 * no new references will come in after we've killed the percpu ref.
+	 */
+	mutex_unlock(&ctx->uring_lock);
 	wait_for_completion(&ctx->ctx_done);
+	mutex_lock(&ctx->uring_lock);
 
 	switch (opcode) {
 	case IORING_REGISTER_BUFFERS:

-- 
Jens Axboe