On Sat, 9 Jul 2022 12:19:56 -0400 Paul Gortmaker wrote: > On 08/07/2022 (Fri 17:27) John Keeping wrote: > > With CONFIG_PREEMPT_RT, it is possible to hit a deadlock between two > > normal priority tasks (SCHED_OTHER, nice level zero): > > > > INFO: task kworker/u8:0:8 blocked for more than 491 seconds. > > Not tainted 5.15.49-rt46 #1 > > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > > task:kworker/u8:0 state:D stack: 0 pid: 8 ppid: 2 flags:0x00000000 > > Workqueue: writeback wb_workfn (flush-7:0) > > This is great. I was on-and-off looking at an ext4 deadlock triggered > by LTP's "ftest01" - and was just potting my notes together this morning > to clarify for myself what I thought was going on before I forgot. > > Which is good, because I then could recognize the very similar patterns > in what you saw to what I'd observed/deduced. Otherise I'd probably not > have had your backtraces grab my attention at all. > > Long story short, I applied this to v5.19-rc3-rt5 and it resolves the > LTP/ext4 deadlock issue. So... > > Tested-by: Paul Gortmaker <paul.gortmaker@xxxxxxxxxxxxx> > > As I'd already put together my ext4 debug info, I'll include it below > in case it helps future people searching for LTP and ftest01. > > Thanks, > Paul. > > ----------------- > > > There is a reproducible deadlock in ext4 on preempt-rt triggered by LTP > "ftest01" > > Reproducer details are at the bottom. What seems to happen is as follows: > > The kworker doing writeback wb_workfn (pid 213 below) stalls on the > one and only down_write in ext4_map_blocks() and never moves again. > > /* > * New blocks allocate and/or writing to unwritten extent > * will possibly result in updating i_data, so we take > * the write lock of i_data_sem, and call get_block() > * with create == 1 flag. > */ > down_write(&EXT4_I(inode)->i_data_sem); > > So who is blocking the kworker? There are two of the ftest01 processes > (pid 818 and pid 821) doing ext4 truncate which look like this: > > ext4_truncate+0x46a/0x660 > --> ext4_ext_truncate+0xd6/0xf0 > --> ext4_ext_remove_space+0x59f/0x1830 > --> jbd2_journal_get_write_access+0x51/0x80 > --> do_get_write_access+0x2ad/0x550 > --> complete_all+0x60/0x60 > --> bit_wait_io+0x11/0x60 > > They too are stalled in D state waiting there forever - presumably > waiting on the kwowrker to make the progress they are waiting for. > > Here is where I think lockdep shows us the problem: > > 5 locks held by ftest01/818: > [...] > #4: ffff957c2a679958 (&ei->i_data_sem){++++}-{3:3}, at: ext4_truncate+0x443/0x660 > > 5 locks held by ftest01/821: > [...] > #4: ffff957c2a67bb78 (&ei->i_data_sem){++++}-{3:3}, at: ext4_truncate+0x443/0x660 > > Yes - that is the one the kworker was trying the down_write() on. > > Also note that the bit_wait is a big telltale sign here - there is only > one in the jbd2's do_get_write_access() function: > > /* > * There is one case we have to be very careful about. If the > * committing transaction is currently writing this buffer out to disk > * and has NOT made a copy-out, then we cannot modify the buffer > * contents at all right now. The essence of copy-out is that it is > * the extra copy, not the primary copy, which gets journaled. If the > * primary copy is already going to disk then we cannot do copy-out > * here. > */ > if (buffer_shadow(bh)) { > JBUFFER_TRACE(jh, "on shadow: sleep"); > spin_unlock(&jh->b_state_lock); > wait_on_bit_io(&bh->b_state, BH_Shadow, TASK_UNINTERRUPTIBLE); > goto repeat; > } > > Presumaby this is a corner case that preempt-rt makes easier to trigger? > > The full trace with interspersed lockdep info follows. > > Various detailed reproducer information follows that. > > ----------------------- > > sysrq-w backtracce information interspersed with sysrq-d lockdep info > on 5.19-rc2-rt3 kernel: > > task:kworker/u129:34 state:D stack: 0 pid: 213 ppid: 2 flags:0x00004000 > Workqueue: writeback wb_workfn (flush-7:0) > Call Trace: > <TASK> > __schedule+0x329/0xae0 > ? __this_cpu_preempt_check+0x13/0x20 > ? __rt_mutex_slowlock.isra.21.constprop.25+0xc0e/0xc40 > schedule+0x48/0xc0 > __rt_mutex_slowlock.isra.21.constprop.25+0x1b8/0xc40 > rwbase_write_lock+0x98/0x6e0 > down_write+0x46/0x50 > ? ext4_map_blocks+0x156/0x5c0 > ext4_map_blocks+0x156/0x5c0 > ext4_writepages+0xa1a/0x1290 > ? __this_cpu_preempt_check+0x13/0x20 > ? lockdep_hardirqs_on+0xcd/0x150 > do_writepages+0xd2/0x1a0 > __writeback_single_inode+0x64/0x850 > writeback_sb_inodes+0x22d/0x5b0 > __writeback_inodes_wb+0x67/0xe0 > wb_writeback+0x269/0x5f0 > ? get_nr_inodes+0x49/0x70 > wb_workfn+0x43d/0x780 > ? __this_cpu_preempt_check+0x13/0x20 > process_one_work+0x286/0x710 > worker_thread+0x3c/0x3d0 > ? process_one_work+0x710/0x710 > kthread+0x13b/0x150 > ? kthread_complete_and_exit+0x20/0x20 > ret_from_fork+0x1f/0x30 > </TASK> > > 6 locks held by kworker/u129:34/213: > #0: ffff957c25e9ef28 ((wq_completion)writeback){+.+.}-{0:0}, at: process_one_work+0x201/0x710 > #1: ffffb7e9c3797e68 ((work_completion)(&(&wb->dwork)->work)){+.+.}-{0:0}, at: process_one_work+0x201/0x710 > #2: ffff957b9c9b70d0 (&type->s_umount_key#34){++++}-{3:3}, at: trylock_super+0x1b/0x50 > #3: ffff957b8a9b2d18 (&sbi->s_writepages_rwsem){.+.+}-{0:0}, at: do_writepages+0xd2/0x1a0 > #4: ffff957b9c9b5be8 (jbd2_handle){++++}-{0:0}, at: start_this_handle+0x13e/0x5f0 > #5: ffff957c2a679958 (&ei->i_data_sem){++++}-{3:3}, at: ext4_map_blocks+0x156/0x5c0 > > > task:jbd2/loop0-8 state:D stack: 0 pid: 679 ppid: 2 flags:0x00004000 > Call Trace: > <TASK> > __schedule+0x329/0xae0 > schedule+0x48/0xc0 > jbd2_journal_wait_updates+0x7c/0xf0 > ? schedstat_stop+0x10/0x10 > jbd2_journal_commit_transaction+0x284/0x20e0 > ? __this_cpu_preempt_check+0x13/0x20 > ? find_held_lock+0x35/0xa0 > ? _raw_spin_unlock_irqrestore+0x86/0x8f > ? _raw_spin_unlock_irqrestore+0x86/0x8f > ? try_to_del_timer_sync+0x53/0x80 > ? trace_hardirqs_on+0x3b/0x120 > ? try_to_del_timer_sync+0x53/0x80 > kjournald2+0xd5/0x2b0 > ? schedstat_stop+0x10/0x10 > ? commit_timeout+0x20/0x20 > kthread+0x13b/0x150 > ? kthread_complete_and_exit+0x20/0x20 > ret_from_fork+0x1f/0x30 > </TASK> > > > task:ftest01 state:D stack: 0 pid: 817 ppid: 816 flags:0x00004002 > Call Trace: > <TASK> > __schedule+0x329/0xae0 > ? __x64_sys_tee+0xd0/0xd0 > ? __x64_sys_tee+0xd0/0xd0 > schedule+0x48/0xc0 > wb_wait_for_completion+0x62/0x90 > ? schedstat_stop+0x10/0x10 > sync_inodes_sb+0xdd/0x460 > ? __this_cpu_preempt_check+0x13/0x20 > ? __x64_sys_tee+0xd0/0xd0 > sync_inodes_one_sb+0x15/0x20 > iterate_supers+0x94/0x100 > ksys_sync+0x42/0xa0 > __do_sys_sync+0xe/0x20 > do_syscall_64+0x3f/0x90 > entry_SYSCALL_64_after_hwframe+0x46/0xb0 > RIP: 0033:0x7f3f9dae2cdb > RSP: 002b:00007ffe06430318 EFLAGS: 00000246 ORIG_RAX: 00000000000000a2 > RAX: ffffffffffffffda RBX: 0000000000000800 RCX: 00007f3f9dae2cdb > RDX: 0000000000000040 RSI: 0000561f57de52f0 RDI: 0000561f57de5340 > RBP: 00000000000ff000 R08: 0000000000000000 R09: 00007f3f9dbce0a0 > R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 > R13: 0000000000000207 R14: 0000000000000049 R15: 0000000000000003 > </TASK> > > 1 lock held by ftest01/817: > #0: ffff957b9c9b70d0 (&type->s_umount_key#34){++++}-{3:3}, at: iterate_supers+0x7d/0x100 > > > task:ftest01 state:D stack: 0 pid: 818 ppid: 816 flags:0x00004002 > Call Trace: > <TASK> > __schedule+0x329/0xae0 > ? bit_wait+0x60/0x60 > schedule+0x48/0xc0 > io_schedule+0x16/0x40 > bit_wait_io+0x11/0x60 > __wait_on_bit_lock+0x5e/0xb0 > out_of_line_wait_on_bit_lock+0x93/0xb0 > ? complete_all+0x60/0x60 > __lock_buffer+0x3f/0x50 pid: 818 was locking buffer > do_get_write_access+0x2ad/0x550 > jbd2_journal_get_write_access+0x51/0x80 > __ext4_journal_get_write_access+0xf9/0x1a0 > ext4_ext_get_access.isra.32+0x34/0x50 > ext4_ext_remove_space+0x59f/0x1830 > ? ext4_ext_truncate+0xa2/0xf0 > ? ext4_ext_truncate+0x5e/0xf0 > ext4_ext_truncate+0xd6/0xf0 > ext4_truncate+0x46a/0x660 > ext4_setattr+0x6db/0xa40 > notify_change+0x3e5/0x540 > do_truncate+0x7b/0xc0 > ? do_truncate+0x7b/0xc0 > do_sys_ftruncate+0x164/0x2f0 > __x64_sys_ftruncate+0x1b/0x20 > do_syscall_64+0x3f/0x90 > entry_SYSCALL_64_after_hwframe+0x46/0xb0 > RIP: 0033:0x7f3f9dae3f7b > RSP: 002b:00007ffe06430318 EFLAGS: 00000202 ORIG_RAX: 000000000000004d > RAX: ffffffffffffffda RBX: 00000000000001a5 RCX: 00007f3f9dae3f7b > RDX: 00000000000001a5 RSI: 00000000000d2800 RDI: 0000000000000003 > RBP: 00000000000001a5 R08: 00007f3f9dbce090 R09: 00007f3f9dbce0a0 > R10: 0000000000000008 R11: 0000000000000202 R12: 0000000000000000 > R13: 0000000000000207 R14: 0000000000000043 R15: 0000000000000003 > </TASK> > > 5 locks held by ftest01/818: > #0: ffff957b9c9b74e8 (sb_writers#3){.+.+}-{0:0}, at: __x64_sys_ftruncate+0x1b/0x20 > #1: ffff957c2a679b00 (&sb->s_type->i_mutex_key#10){+.+.}-{3:3}, at: do_truncate+0x6a/0xc0 > #2: ffff957c2a679cc0 (mapping.invalidate_lock){++++}-{3:3}, at: ext4_setattr+0x39a/0xa40 > #3: ffff957b9c9b5be8 (jbd2_handle){++++}-{0:0}, at: start_this_handle+0x13e/0x5f0 > #4: ffff957c2a679958 (&ei->i_data_sem){++++}-{3:3}, at: ext4_truncate+0x443/0x660 > > > task:ftest01 state:D stack: 0 pid: 819 ppid: 816 flags:0x00004002 > Call Trace: > <TASK> > __schedule+0x329/0xae0 > ? __x64_sys_tee+0xd0/0xd0 > ? __x64_sys_tee+0xd0/0xd0 > schedule+0x48/0xc0 > wb_wait_for_completion+0x62/0x90 > ? schedstat_stop+0x10/0x10 > sync_inodes_sb+0xdd/0x460 > ? __this_cpu_preempt_check+0x13/0x20 > ? __x64_sys_tee+0xd0/0xd0 > sync_inodes_one_sb+0x15/0x20 > iterate_supers+0x94/0x100 > ksys_sync+0x42/0xa0 > __do_sys_sync+0xe/0x20 > do_syscall_64+0x3f/0x90 > entry_SYSCALL_64_after_hwframe+0x46/0xb0 > RIP: 0033:0x7f3f9dae2cdb > RSP: 002b:00007ffe06430318 EFLAGS: 00000246 ORIG_RAX: 00000000000000a2 > RAX: ffffffffffffffda RBX: 0000000000000800 RCX: 00007f3f9dae2cdb > RDX: 0000000000000040 RSI: 0000561f57de52f0 RDI: 0000561f57de5340 > RBP: 00000000000c2800 R08: 00000000fffff800 R09: 0000000000004000 > R10: 0000000000000010 R11: 0000000000000246 R12: 0000000000000000 > R13: 0000000000000207 R14: 0000000000000075 R15: 0000000000000003 > </TASK> > > 1 lock held by ftest01/819: > #0: ffff957b9c9b70d0 (&type->s_umount_key#34){++++}-{3:3}, at: iterate_supers+0x7d/0x100 > > > task:ftest01 state:D stack: 0 pid: 820 ppid: 816 flags:0x00004002 > Call Trace: > <TASK> > __schedule+0x329/0xae0 > ? __x64_sys_tee+0xd0/0xd0 > ? __x64_sys_tee+0xd0/0xd0 > schedule+0x48/0xc0 > wb_wait_for_completion+0x62/0x90 > ? schedstat_stop+0x10/0x10 > sync_inodes_sb+0xdd/0x460 > ? __this_cpu_preempt_check+0x13/0x20 > ? __x64_sys_tee+0xd0/0xd0 > sync_inodes_one_sb+0x15/0x20 > iterate_supers+0x94/0x100 > ksys_sync+0x42/0xa0 > __do_sys_sync+0xe/0x20 > do_syscall_64+0x3f/0x90 > entry_SYSCALL_64_after_hwframe+0x46/0xb0 > RIP: 0033:0x7f3f9dae2cdb > RSP: 002b:00007ffe06430318 EFLAGS: 00000246 ORIG_RAX: 00000000000000a2 > RAX: ffffffffffffffda RBX: 0000000000000800 RCX: 00007f3f9dae2cdb > RDX: 0000000000000040 RSI: 0000561f57de52f0 RDI: 0000561f57de5340 > RBP: 000000000006a800 R08: 00000000fffff800 R09: 0000000000004000 > R10: 0000000000000010 R11: 0000000000000246 R12: 0000000000000000 > R13: 0000000000000207 R14: 0000000000000069 R15: 0000000000000003 > </TASK> > > 1 lock held by ftest01/820: > #0: ffff957b9c9b70d0 (&type->s_umount_key#34){++++}-{3:3}, at: iterate_supers+0x7d/0x100 > > > task:ftest01 state:D stack: 0 pid: 821 ppid: 816 flags:0x00000002 > Call Trace: > <TASK> > __schedule+0x329/0xae0 > ? bit_wait+0x60/0x60 > schedule+0x48/0xc0 > io_schedule+0x16/0x40 > bit_wait_io+0x11/0x60 > __wait_on_bit_lock+0x5e/0xb0 > out_of_line_wait_on_bit_lock+0x93/0xb0 > ? complete_all+0x60/0x60 > __lock_buffer+0x3f/0x50 So was pid: 821. > do_get_write_access+0x2ad/0x550 > jbd2_journal_get_write_access+0x51/0x80 > __ext4_journal_get_write_access+0xf9/0x1a0 > ext4_ext_get_access.isra.32+0x34/0x50 > ext4_ext_remove_space+0x59f/0x1830 > ? ext4_ext_truncate+0xa2/0xf0 > ? ext4_ext_truncate+0x5e/0xf0 > ext4_ext_truncate+0xd6/0xf0 > ext4_truncate+0x46a/0x660 > ext4_setattr+0x6db/0xa40 > notify_change+0x3e5/0x540 > do_truncate+0x7b/0xc0 > ? do_truncate+0x7b/0xc0 > do_sys_ftruncate+0x164/0x2f0 > __x64_sys_ftruncate+0x1b/0x20 > do_syscall_64+0x3f/0x90 > entry_SYSCALL_64_after_hwframe+0x46/0xb0 > RIP: 0033:0x7f3f9dae3f7b > RSP: 002b:00007ffe06430318 EFLAGS: 00000202 ORIG_RAX: 000000000000004d > RAX: ffffffffffffffda RBX: 000000000000016a RCX: 00007f3f9dae3f7b > RDX: 000000000000016a RSI: 00000000000b5000 RDI: 0000000000000003 > RBP: 000000000000016a R08: 00007f3f9dbce080 R09: 00007f3f9dbce0a0 > R10: 0000000000000002 R11: 0000000000000202 R12: 0000000000000000 > R13: 0000000000000207 R14: 0000000000000040 R15: 0000000000000003 > </TASK> > > 5 locks held by ftest01/821: > #0: ffff957b9c9b74e8 (sb_writers#3){.+.+}-{0:0}, at: __x64_sys_ftruncate+0x1b/0x20 > #1: ffff957c2a67bd20 (&sb->s_type->i_mutex_key#10){+.+.}-{3:3}, at: do_truncate+0x6a/0xc0 > #2: ffff957c2a67bee0 (mapping.invalidate_lock){++++}-{3:3}, at: ext4_setattr+0x39a/0xa40 > #3: ffff957b9c9b5be8 (jbd2_handle){++++}-{0:0}, at: start_this_handle+0x13e/0x5f0 > #4: ffff957c2a67bb78 (&ei->i_data_sem){++++}-{3:3}, at: ext4_truncate+0x443/0x660 > Looks like a ABBA deadlock that should be fed to lockdep, given what was blocking flusher (pid 213). > -------------------------------------- > > Reproducer: > > Get latest preempt-rt kernel: > git://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-rt-devel.git > checkout v5.19-rc3-rt5 (or linux-5.19.y-rt branch) > make defconfig > manually enable CONFIG_EXPERT and then CONFIG_PREEMPT_RT > make oldconfig and take defaults for other options hidden behind EXPERT > can optionally enable LOCKDEP options so sysrq-d works. > > Get LTP ; build + install > https://github.com/linux-test-project/ltp.git > > Boot and then run reproducer: > dd if=/dev/zero of=/root/tmp_mount_file bs=1M count=512 > mkfs.ext4 -F /root/tmp_mount_file > mount /root/tmp_mount_file /mnt/ > /opt/ltp/runltp -f fs -s ftest01 -d /mnt/ > umount /mnt > > loop device isn't strictly necessary but contains fs fallout when > inevitable reboot is required to get rid of D state processes. > > Test completes on v5.10-rt, v5.12-rt - should take about 2-3m total. > Test hangs forever on v5.15-rt and latest v5.19-rc-rt. Test threads > in D state - any other i/o directed at /mnt also hangs in D state. > Make lock_buffer have mutex_lock() semantics in attempt to catch the deadlock above. Only for thoughts now. Hillf --- mainline/fs/buffer.c +++ b/fs/buffer.c @@ -66,12 +66,17 @@ EXPORT_SYMBOL(touch_buffer); void __lock_buffer(struct buffer_head *bh) { + /* lock_buffer has mutex_lock() semantics here */ + mutex_acquire(&bh->b_dep_map, 0, 0, _RET_IP_); + wait_on_bit_lock_io(&bh->b_state, BH_Lock, TASK_UNINTERRUPTIBLE); } EXPORT_SYMBOL(__lock_buffer); void unlock_buffer(struct buffer_head *bh) { + mutex_release(&bh->b_dep_map, _RET_IP_); + clear_bit_unlock(BH_Lock, &bh->b_state); smp_mb__after_atomic(); wake_up_bit(&bh->b_state, BH_Lock); @@ -3289,12 +3294,22 @@ static void recalc_bh_state(void) buffer_heads_over_limit = (tot > max_buffer_heads); } +static void buffer_head_init_lockdep(struct buffer_head *buf) +{ +#ifdef CONFIG_DEBUG_LOCK_ALLOC + static struct lock_class_key bh_key; + + lockdep_init_map_wait(&buf->b_dep_map, "buffer_head_lock", &bh_key, 0, LD_WAIT_SLEEP); +#endif +} + struct buffer_head *alloc_buffer_head(gfp_t gfp_flags) { struct buffer_head *ret = kmem_cache_zalloc(bh_cachep, gfp_flags); if (ret) { INIT_LIST_HEAD(&ret->b_assoc_buffers); spin_lock_init(&ret->b_uptodate_lock); + buffer_head_init_lockdep(ret); preempt_disable(); __this_cpu_inc(bh_accounting.nr); recalc_bh_state(); --- a/include/linux/buffer_head.h +++ b/include/linux/buffer_head.h @@ -76,6 +76,9 @@ struct buffer_head { spinlock_t b_uptodate_lock; /* Used by the first bh in a page, to * serialise IO completion of other * buffers in the page */ +#ifdef CONFIG_DEBUG_LOCK_ALLOC + struct lockdep_map b_dep_map; +#endif }; /*