Re: [PATCH v3] sched/core: Always flush pending blk_plug

Hillf Danton <hdanton@xxxxxxxx> · Sun, 10 Jul 2022 09:01:36 +0800

On Sat, 9 Jul 2022 12:19:56 -0400 Paul Gortmaker wrote:
> On 08/07/2022 (Fri 17:27) John Keeping wrote:
> > With CONFIG_PREEMPT_RT, it is possible to hit a deadlock between two
> > normal priority tasks (SCHED_OTHER, nice level zero):
> > 
> > 	INFO: task kworker/u8:0:8 blocked for more than 491 seconds.
> > 	      Not tainted 5.15.49-rt46 #1
> > 	"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > 	task:kworker/u8:0    state:D stack:    0 pid:    8 ppid:     2 flags:0x00000000
> > 	Workqueue: writeback wb_workfn (flush-7:0)
> 
> This is great.  I was on-and-off looking at an ext4 deadlock triggered
> by LTP's "ftest01" - and was just potting my notes together this morning
> to clarify for myself what I thought was going on before I forgot.
> 
> Which is good, because I then could recognize the very similar patterns
> in what you saw to what I'd observed/deduced.  Otherise I'd probably not
> have had your backtraces grab my attention at all.
> 
> Long story short, I applied this to v5.19-rc3-rt5 and it resolves the
> LTP/ext4 deadlock issue.  So...
> 
> Tested-by: Paul Gortmaker <paul.gortmaker@xxxxxxxxxxxxx>
> 
> As I'd already put together my ext4 debug info, I'll include it below
> in case it helps future people searching for LTP and ftest01.
> 
> Thanks,
> Paul.
> 
> 	-----------------
> 
> 
> There is a reproducible deadlock in ext4 on preempt-rt triggered by LTP
> "ftest01" 
> 
> Reproducer details are at the bottom.  What seems to happen is as follows:
> 
> The kworker doing writeback wb_workfn (pid 213 below) stalls on the 
> one and only down_write in ext4_map_blocks() and never moves again.
> 
>       /*
>        * New blocks allocate and/or writing to unwritten extent
>        * will possibly result in updating i_data, so we take
>        * the write lock of i_data_sem, and call get_block()
>        * with create == 1 flag.
>        */
>       down_write(&EXT4_I(inode)->i_data_sem);
> 
> So who is blocking the kworker?  There are two of the ftest01 processes
> (pid 818 and pid 821) doing ext4 truncate which look like this:
> 
>   ext4_truncate+0x46a/0x660
>    --> ext4_ext_truncate+0xd6/0xf0
>        --> ext4_ext_remove_space+0x59f/0x1830
>            --> jbd2_journal_get_write_access+0x51/0x80
>                --> do_get_write_access+0x2ad/0x550
>                    --> complete_all+0x60/0x60
>                        --> bit_wait_io+0x11/0x60
> 
> They too are stalled in D state waiting there forever - presumably
> waiting on the kwowrker to make the progress they are waiting for.
> 
> Here is where I think lockdep shows us the problem:
> 
>   5 locks held by ftest01/818:
>    [...]
>    #4: ffff957c2a679958 (&ei->i_data_sem){++++}-{3:3}, at: ext4_truncate+0x443/0x660
> 
>   5 locks held by ftest01/821:
>    [...]
>    #4: ffff957c2a67bb78 (&ei->i_data_sem){++++}-{3:3}, at: ext4_truncate+0x443/0x660
> 
> Yes - that is the one the kworker was trying the down_write() on.
> 
> Also note that the bit_wait is a big telltale sign here - there is only
> one in the jbd2's do_get_write_access() function:
> 
>       /*
>        * There is one case we have to be very careful about.  If the
>        * committing transaction is currently writing this buffer out to disk
>        * and has NOT made a copy-out, then we cannot modify the buffer
>        * contents at all right now.  The essence of copy-out is that it is
>        * the extra copy, not the primary copy, which gets journaled.  If the
>        * primary copy is already going to disk then we cannot do copy-out
>        * here.
>        */
>       if (buffer_shadow(bh)) {
>               JBUFFER_TRACE(jh, "on shadow: sleep");
>               spin_unlock(&jh->b_state_lock);
>               wait_on_bit_io(&bh->b_state, BH_Shadow, TASK_UNINTERRUPTIBLE);
>               goto repeat;
>       }
> 
> Presumaby this is a corner case that preempt-rt makes easier to trigger?
> 
> The full trace with interspersed lockdep info follows.
> 
> Various detailed reproducer information follows that.
> 
> 	-----------------------
> 
> sysrq-w backtracce information interspersed with sysrq-d lockdep info
> on 5.19-rc2-rt3 kernel:
> 
>  task:kworker/u129:34 state:D stack:    0 pid:  213 ppid:     2 flags:0x00004000
>  Workqueue: writeback wb_workfn (flush-7:0)
>  Call Trace:
>   <TASK>
>   __schedule+0x329/0xae0
>   ? __this_cpu_preempt_check+0x13/0x20
>   ? __rt_mutex_slowlock.isra.21.constprop.25+0xc0e/0xc40
>   schedule+0x48/0xc0
>   __rt_mutex_slowlock.isra.21.constprop.25+0x1b8/0xc40
>   rwbase_write_lock+0x98/0x6e0
>   down_write+0x46/0x50
>   ? ext4_map_blocks+0x156/0x5c0
>   ext4_map_blocks+0x156/0x5c0
>   ext4_writepages+0xa1a/0x1290
>   ? __this_cpu_preempt_check+0x13/0x20
>   ? lockdep_hardirqs_on+0xcd/0x150
>   do_writepages+0xd2/0x1a0
>   __writeback_single_inode+0x64/0x850
>   writeback_sb_inodes+0x22d/0x5b0
>   __writeback_inodes_wb+0x67/0xe0
>   wb_writeback+0x269/0x5f0
>   ? get_nr_inodes+0x49/0x70
>   wb_workfn+0x43d/0x780
>   ? __this_cpu_preempt_check+0x13/0x20
>   process_one_work+0x286/0x710
>   worker_thread+0x3c/0x3d0
>   ? process_one_work+0x710/0x710
>   kthread+0x13b/0x150
>   ? kthread_complete_and_exit+0x20/0x20
>   ret_from_fork+0x1f/0x30
>   </TASK>
> 
>   6 locks held by kworker/u129:34/213:
>    #0: ffff957c25e9ef28 ((wq_completion)writeback){+.+.}-{0:0}, at: process_one_work+0x201/0x710
>    #1: ffffb7e9c3797e68 ((work_completion)(&(&wb->dwork)->work)){+.+.}-{0:0}, at: process_one_work+0x201/0x710
>    #2: ffff957b9c9b70d0 (&type->s_umount_key#34){++++}-{3:3}, at: trylock_super+0x1b/0x50
>    #3: ffff957b8a9b2d18 (&sbi->s_writepages_rwsem){.+.+}-{0:0}, at: do_writepages+0xd2/0x1a0
>    #4: ffff957b9c9b5be8 (jbd2_handle){++++}-{0:0}, at: start_this_handle+0x13e/0x5f0
>    #5: ffff957c2a679958 (&ei->i_data_sem){++++}-{3:3}, at: ext4_map_blocks+0x156/0x5c0
> 
> 
>  task:jbd2/loop0-8    state:D stack:    0 pid:  679 ppid:     2 flags:0x00004000
>  Call Trace:
>   <TASK>
>   __schedule+0x329/0xae0
>   schedule+0x48/0xc0
>   jbd2_journal_wait_updates+0x7c/0xf0
>   ? schedstat_stop+0x10/0x10
>   jbd2_journal_commit_transaction+0x284/0x20e0
>   ? __this_cpu_preempt_check+0x13/0x20
>   ? find_held_lock+0x35/0xa0
>   ? _raw_spin_unlock_irqrestore+0x86/0x8f
>   ? _raw_spin_unlock_irqrestore+0x86/0x8f
>   ? try_to_del_timer_sync+0x53/0x80
>   ? trace_hardirqs_on+0x3b/0x120
>   ? try_to_del_timer_sync+0x53/0x80
>   kjournald2+0xd5/0x2b0
>   ? schedstat_stop+0x10/0x10
>   ? commit_timeout+0x20/0x20
>   kthread+0x13b/0x150
>   ? kthread_complete_and_exit+0x20/0x20
>   ret_from_fork+0x1f/0x30
>   </TASK>
> 
> 
>  task:ftest01         state:D stack:    0 pid:  817 ppid:   816 flags:0x00004002
>  Call Trace:
>   <TASK>
>   __schedule+0x329/0xae0
>   ? __x64_sys_tee+0xd0/0xd0
>   ? __x64_sys_tee+0xd0/0xd0
>   schedule+0x48/0xc0
>   wb_wait_for_completion+0x62/0x90
>   ? schedstat_stop+0x10/0x10
>   sync_inodes_sb+0xdd/0x460
>   ? __this_cpu_preempt_check+0x13/0x20
>   ? __x64_sys_tee+0xd0/0xd0
>   sync_inodes_one_sb+0x15/0x20
>   iterate_supers+0x94/0x100
>   ksys_sync+0x42/0xa0
>   __do_sys_sync+0xe/0x20
>   do_syscall_64+0x3f/0x90
>   entry_SYSCALL_64_after_hwframe+0x46/0xb0
>  RIP: 0033:0x7f3f9dae2cdb
>  RSP: 002b:00007ffe06430318 EFLAGS: 00000246 ORIG_RAX: 00000000000000a2
>  RAX: ffffffffffffffda RBX: 0000000000000800 RCX: 00007f3f9dae2cdb
>  RDX: 0000000000000040 RSI: 0000561f57de52f0 RDI: 0000561f57de5340
>  RBP: 00000000000ff000 R08: 0000000000000000 R09: 00007f3f9dbce0a0
>  R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
>  R13: 0000000000000207 R14: 0000000000000049 R15: 0000000000000003
>   </TASK>
> 
>   1 lock held by ftest01/817:
>    #0: ffff957b9c9b70d0 (&type->s_umount_key#34){++++}-{3:3}, at: iterate_supers+0x7d/0x100
> 
> 
>  task:ftest01         state:D stack:    0 pid:  818 ppid:   816 flags:0x00004002
>  Call Trace:
>   <TASK>
>   __schedule+0x329/0xae0
>   ? bit_wait+0x60/0x60
>   schedule+0x48/0xc0
>   io_schedule+0x16/0x40
>   bit_wait_io+0x11/0x60
>   __wait_on_bit_lock+0x5e/0xb0
>   out_of_line_wait_on_bit_lock+0x93/0xb0
>   ? complete_all+0x60/0x60
>   __lock_buffer+0x3f/0x50

pid: 818 was locking buffer

>   do_get_write_access+0x2ad/0x550
>   jbd2_journal_get_write_access+0x51/0x80
>   __ext4_journal_get_write_access+0xf9/0x1a0
>   ext4_ext_get_access.isra.32+0x34/0x50
>   ext4_ext_remove_space+0x59f/0x1830
>   ? ext4_ext_truncate+0xa2/0xf0
>   ? ext4_ext_truncate+0x5e/0xf0
>   ext4_ext_truncate+0xd6/0xf0
>   ext4_truncate+0x46a/0x660
>   ext4_setattr+0x6db/0xa40
>   notify_change+0x3e5/0x540
>   do_truncate+0x7b/0xc0
>   ? do_truncate+0x7b/0xc0
>   do_sys_ftruncate+0x164/0x2f0
>   __x64_sys_ftruncate+0x1b/0x20
>   do_syscall_64+0x3f/0x90
>   entry_SYSCALL_64_after_hwframe+0x46/0xb0
>  RIP: 0033:0x7f3f9dae3f7b
>  RSP: 002b:00007ffe06430318 EFLAGS: 00000202 ORIG_RAX: 000000000000004d
>  RAX: ffffffffffffffda RBX: 00000000000001a5 RCX: 00007f3f9dae3f7b
>  RDX: 00000000000001a5 RSI: 00000000000d2800 RDI: 0000000000000003
>  RBP: 00000000000001a5 R08: 00007f3f9dbce090 R09: 00007f3f9dbce0a0
>  R10: 0000000000000008 R11: 0000000000000202 R12: 0000000000000000
>  R13: 0000000000000207 R14: 0000000000000043 R15: 0000000000000003
>   </TASK>
> 
>   5 locks held by ftest01/818:
>    #0: ffff957b9c9b74e8 (sb_writers#3){.+.+}-{0:0}, at: __x64_sys_ftruncate+0x1b/0x20
>    #1: ffff957c2a679b00 (&sb->s_type->i_mutex_key#10){+.+.}-{3:3}, at: do_truncate+0x6a/0xc0
>    #2: ffff957c2a679cc0 (mapping.invalidate_lock){++++}-{3:3}, at: ext4_setattr+0x39a/0xa40
>    #3: ffff957b9c9b5be8 (jbd2_handle){++++}-{0:0}, at: start_this_handle+0x13e/0x5f0
>    #4: ffff957c2a679958 (&ei->i_data_sem){++++}-{3:3}, at: ext4_truncate+0x443/0x660
> 
> 
>  task:ftest01         state:D stack:    0 pid:  819 ppid:   816 flags:0x00004002
>  Call Trace:
>   <TASK>
>   __schedule+0x329/0xae0
>   ? __x64_sys_tee+0xd0/0xd0
>   ? __x64_sys_tee+0xd0/0xd0
>   schedule+0x48/0xc0
>   wb_wait_for_completion+0x62/0x90
>   ? schedstat_stop+0x10/0x10
>   sync_inodes_sb+0xdd/0x460
>   ? __this_cpu_preempt_check+0x13/0x20
>   ? __x64_sys_tee+0xd0/0xd0
>   sync_inodes_one_sb+0x15/0x20
>   iterate_supers+0x94/0x100
>   ksys_sync+0x42/0xa0
>   __do_sys_sync+0xe/0x20
>   do_syscall_64+0x3f/0x90
>   entry_SYSCALL_64_after_hwframe+0x46/0xb0
>  RIP: 0033:0x7f3f9dae2cdb
>  RSP: 002b:00007ffe06430318 EFLAGS: 00000246 ORIG_RAX: 00000000000000a2
>  RAX: ffffffffffffffda RBX: 0000000000000800 RCX: 00007f3f9dae2cdb
>  RDX: 0000000000000040 RSI: 0000561f57de52f0 RDI: 0000561f57de5340
>  RBP: 00000000000c2800 R08: 00000000fffff800 R09: 0000000000004000
>  R10: 0000000000000010 R11: 0000000000000246 R12: 0000000000000000
>  R13: 0000000000000207 R14: 0000000000000075 R15: 0000000000000003
>   </TASK>
> 
>   1 lock held by ftest01/819:
>    #0: ffff957b9c9b70d0 (&type->s_umount_key#34){++++}-{3:3}, at: iterate_supers+0x7d/0x100
> 
> 
>  task:ftest01         state:D stack:    0 pid:  820 ppid:   816 flags:0x00004002
>  Call Trace:
>   <TASK>
>   __schedule+0x329/0xae0
>   ? __x64_sys_tee+0xd0/0xd0
>   ? __x64_sys_tee+0xd0/0xd0
>   schedule+0x48/0xc0
>   wb_wait_for_completion+0x62/0x90
>   ? schedstat_stop+0x10/0x10
>   sync_inodes_sb+0xdd/0x460
>   ? __this_cpu_preempt_check+0x13/0x20
>   ? __x64_sys_tee+0xd0/0xd0
>   sync_inodes_one_sb+0x15/0x20
>   iterate_supers+0x94/0x100
>   ksys_sync+0x42/0xa0
>   __do_sys_sync+0xe/0x20
>   do_syscall_64+0x3f/0x90
>   entry_SYSCALL_64_after_hwframe+0x46/0xb0
>  RIP: 0033:0x7f3f9dae2cdb
>  RSP: 002b:00007ffe06430318 EFLAGS: 00000246 ORIG_RAX: 00000000000000a2
>  RAX: ffffffffffffffda RBX: 0000000000000800 RCX: 00007f3f9dae2cdb
>  RDX: 0000000000000040 RSI: 0000561f57de52f0 RDI: 0000561f57de5340
>  RBP: 000000000006a800 R08: 00000000fffff800 R09: 0000000000004000
>  R10: 0000000000000010 R11: 0000000000000246 R12: 0000000000000000
>  R13: 0000000000000207 R14: 0000000000000069 R15: 0000000000000003
>   </TASK>
> 
>   1 lock held by ftest01/820:
>    #0: ffff957b9c9b70d0 (&type->s_umount_key#34){++++}-{3:3}, at: iterate_supers+0x7d/0x100
> 
> 
>  task:ftest01         state:D stack:    0 pid:  821 ppid:   816 flags:0x00000002
>  Call Trace:
>   <TASK>
>   __schedule+0x329/0xae0
>   ? bit_wait+0x60/0x60
>   schedule+0x48/0xc0
>   io_schedule+0x16/0x40
>   bit_wait_io+0x11/0x60
>   __wait_on_bit_lock+0x5e/0xb0
>   out_of_line_wait_on_bit_lock+0x93/0xb0
>   ? complete_all+0x60/0x60
>   __lock_buffer+0x3f/0x50

So was pid: 821.

>   do_get_write_access+0x2ad/0x550
>   jbd2_journal_get_write_access+0x51/0x80
>   __ext4_journal_get_write_access+0xf9/0x1a0
>   ext4_ext_get_access.isra.32+0x34/0x50
>   ext4_ext_remove_space+0x59f/0x1830
>   ? ext4_ext_truncate+0xa2/0xf0
>   ? ext4_ext_truncate+0x5e/0xf0
>   ext4_ext_truncate+0xd6/0xf0
>   ext4_truncate+0x46a/0x660
>   ext4_setattr+0x6db/0xa40
>   notify_change+0x3e5/0x540
>   do_truncate+0x7b/0xc0
>   ? do_truncate+0x7b/0xc0
>   do_sys_ftruncate+0x164/0x2f0
>   __x64_sys_ftruncate+0x1b/0x20
>   do_syscall_64+0x3f/0x90
>   entry_SYSCALL_64_after_hwframe+0x46/0xb0
>  RIP: 0033:0x7f3f9dae3f7b
>  RSP: 002b:00007ffe06430318 EFLAGS: 00000202 ORIG_RAX: 000000000000004d
>  RAX: ffffffffffffffda RBX: 000000000000016a RCX: 00007f3f9dae3f7b
>  RDX: 000000000000016a RSI: 00000000000b5000 RDI: 0000000000000003
>  RBP: 000000000000016a R08: 00007f3f9dbce080 R09: 00007f3f9dbce0a0
>  R10: 0000000000000002 R11: 0000000000000202 R12: 0000000000000000
>  R13: 0000000000000207 R14: 0000000000000040 R15: 0000000000000003
>   </TASK>
> 
>   5 locks held by ftest01/821:
>    #0: ffff957b9c9b74e8 (sb_writers#3){.+.+}-{0:0}, at: __x64_sys_ftruncate+0x1b/0x20
>    #1: ffff957c2a67bd20 (&sb->s_type->i_mutex_key#10){+.+.}-{3:3}, at: do_truncate+0x6a/0xc0
>    #2: ffff957c2a67bee0 (mapping.invalidate_lock){++++}-{3:3}, at: ext4_setattr+0x39a/0xa40
>    #3: ffff957b9c9b5be8 (jbd2_handle){++++}-{0:0}, at: start_this_handle+0x13e/0x5f0
>    #4: ffff957c2a67bb78 (&ei->i_data_sem){++++}-{3:3}, at: ext4_truncate+0x443/0x660
> 

Looks like a ABBA deadlock that should be fed to lockdep, given what was
blocking flusher (pid 213).

> 		--------------------------------------
> 
> Reproducer:
> 
> Get latest preempt-rt kernel:
> 	git://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-rt-devel.git
> 	checkout v5.19-rc3-rt5 (or linux-5.19.y-rt branch)
> 	make defconfig
> 	manually enable CONFIG_EXPERT and then CONFIG_PREEMPT_RT
> 	make oldconfig and take defaults for other options hidden behind EXPERT
> 	can optionally enable LOCKDEP options so sysrq-d works.
> 
> Get LTP ; build + install
> 	https://github.com/linux-test-project/ltp.git
> 
> Boot and then run reproducer:
>    dd if=/dev/zero of=/root/tmp_mount_file bs=1M count=512
>    mkfs.ext4 -F /root/tmp_mount_file
>    mount /root/tmp_mount_file /mnt/
>    /opt/ltp/runltp -f fs -s ftest01 -d /mnt/
>    umount /mnt
> 
>    loop device isn't strictly necessary but contains fs fallout when
>    inevitable reboot is required to get rid of D state processes.
> 
>    Test completes on v5.10-rt, v5.12-rt - should take about 2-3m total.
>    Test hangs forever on v5.15-rt and latest v5.19-rc-rt.  Test threads
>    in D state - any other i/o directed at /mnt also hangs in D state.
> 

Make lock_buffer have mutex_lock() semantics in attempt to catch the
deadlock above.

Only for thoughts now.

Hillf

--- mainline/fs/buffer.c
+++ b/fs/buffer.c
@@ -66,12 +66,17 @@ EXPORT_SYMBOL(touch_buffer);
 
 void __lock_buffer(struct buffer_head *bh)
 {
+	/* lock_buffer has mutex_lock() semantics here */
+	mutex_acquire(&bh->b_dep_map, 0, 0, _RET_IP_);
+
 	wait_on_bit_lock_io(&bh->b_state, BH_Lock, TASK_UNINTERRUPTIBLE);
 }
 EXPORT_SYMBOL(__lock_buffer);
 
 void unlock_buffer(struct buffer_head *bh)
 {
+	mutex_release(&bh->b_dep_map, _RET_IP_);
+
 	clear_bit_unlock(BH_Lock, &bh->b_state);
 	smp_mb__after_atomic();
 	wake_up_bit(&bh->b_state, BH_Lock);
@@ -3289,12 +3294,22 @@ static void recalc_bh_state(void)
 	buffer_heads_over_limit = (tot > max_buffer_heads);
 }
 
+static void buffer_head_init_lockdep(struct buffer_head *buf)
+{
+#ifdef CONFIG_DEBUG_LOCK_ALLOC
+	static struct lock_class_key bh_key;
+
+	lockdep_init_map_wait(&buf->b_dep_map, "buffer_head_lock", &bh_key, 0, LD_WAIT_SLEEP);
+#endif
+}
+
 struct buffer_head *alloc_buffer_head(gfp_t gfp_flags)
 {
 	struct buffer_head *ret = kmem_cache_zalloc(bh_cachep, gfp_flags);
 	if (ret) {
 		INIT_LIST_HEAD(&ret->b_assoc_buffers);
 		spin_lock_init(&ret->b_uptodate_lock);
+		buffer_head_init_lockdep(ret);
 		preempt_disable();
 		__this_cpu_inc(bh_accounting.nr);
 		recalc_bh_state();
--- a/include/linux/buffer_head.h
+++ b/include/linux/buffer_head.h
@@ -76,6 +76,9 @@ struct buffer_head {
 	spinlock_t b_uptodate_lock;	/* Used by the first bh in a page, to
 					 * serialise IO completion of other
 					 * buffers in the page */
+#ifdef CONFIG_DEBUG_LOCK_ALLOC
+	struct lockdep_map	b_dep_map;
+#endif
 };
 
 /*