Adding David Howells in case he has opinions. On Sun, Jan 26, 2025 at 1:37 AM Wang Zhaolong <wangzhaolong1@xxxxxxxxxx> wrote: > > In the code of the LTS branch that is being maintained (from linux-5.4 to > linux-6.6), a deadlock occurs in the network reconnection scenario When > multiple processes or threads write to the same file concurrently. > > Take the code of linux-5.10 as an example. The simplified deadlock process > is as follows: > > ``` > Process 1 Process 2 > lock_page() - [1] > wait_on_page_writeback() - [2] Waiting for writeback, blocked by [4] > > lock_page() - [3] Blocked by [1] > end_page_writeback() - [4] Won't execute > ``` > > Based on my research, I'm going to use two detailed scenarios to illustrate > the issue. > > Scenarios 1: > > ``` > P1 (dd) P2 (cifsd) P3 (cifsiod) > > cifs_writepages > wdata_prepare_pages > lock_page - [1] > wait_on_page_writeback - [2] Waiting for writeback, blocked by [4] > wait_on_page_bit > cifs_demultiplex_thread > cifs_read_from_socket > cifs_readv_from_socket > - If another process triggers reconnect at this point > cifs_reconnect > - mid->mid_state updated to MID_RETRY_NEEDED > smb2_writev_callback mid_entry->callback() > - mid_state leads to wdata->result = -EAGAIN > wdata->result = -EAGAIN > queue_work(cifsiod_wq, &wdata->work); > cifs_writev_complete - worker function > - wdata->result == -EAGAIN Condition satisfied > cifs_writev_requeue > lock_page - [3] Blocked by [1] > end_page_writeback > - [4] Won't execute > unlock_page > ``` > > Mainline refactoring commit d08089f649a0 ("cifs: Change the I/O paths to use > an iterator rather than a page list") unlock folio while waiting for the > writeback to complete. This patch is introduced in v6.3-rc1. Therefore, scenario 1 > only affects LTS versions from linux-5.4 to linux-6.1. > > Call stack trace: > > ``` > cat /proc/34/stack > [<0>] __lock_page+0x147/0x3a0 > [<0>] cifs_writev_requeue.cold+0x185/0x28e > [<0>] process_one_work+0x1df/0x3b0 > [<0>] worker_thread+0x4a/0x3c0 > [<0>] kthread+0x125/0x160 > [<0>] ret_from_fork+0x22/0x30 > > # cat /proc/465/stack > [<0>] wait_on_page_bit+0x106/0x2e0 > [<0>] wait_on_page_writeback+0x25/0xd0 > [<0>] cifs_writepages+0x5ee/0xf60 > [<0>] do_writepages+0x43/0xe0 > [<0>] __filemap_fdatawrite_range+0xcd/0x110 > [<0>] file_write_and_wait_range+0x40/0x90 > [<0>] cifs_strict_fsync+0x35/0x470 > [<0>] do_fsync+0x38/0x70 > [<0>] __x64_sys_fsync+0x10/0x20 > [<0>] do_syscall_64+0x33/0x40 > [<0>] entry_SYSCALL_64_after_hwframe+0x67/0xd1 > > [ 369.826215] INFO: task kworker/1:1:34 blocked for more than 122 seconds. > [ 369.828964] Not tainted 5.10.0+ #164 > [ 369.830623] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > [ 369.835104] task:kworker/1:1 state:D stack:13472 pid: 34 ppid: 2 flags:0x00004000 > [ 369.838448] Workqueue: cifsiod cifs_writev_complete > [ 369.840242] Call Trace: > [ 369.841219] __schedule+0x401/0x8e0 > [ 369.842568] schedule+0x49/0x130 > [ 369.843785] io_schedule+0x12/0x40 > [ 369.845079] __lock_page+0x147/0x3a0 > [ 369.846444] ? add_to_page_cache_lru+0x180/0x180 > [ 369.847963] cifs_writev_requeue.cold+0x185/0x28e > [ 369.849193] process_one_work+0x1df/0x3b0 > [ 369.850248] worker_thread+0x4a/0x3c0 > [ 369.851216] ? process_one_work+0x3b0/0x3b0 > [ 369.852308] kthread+0x125/0x160 > [ 369.853167] ? kthread_park+0x90/0x90 > [ 369.854142] ret_from_fork+0x22/0x30 > [ 369.855054] INFO: task kworker/u8:3:96 blocked for more than 122 seconds. > [ 369.856781] Not tainted 5.10.0+ #164 > [ 369.857851] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > [ 369.859419] task:kworker/u8:3 state:D stack:12744 pid: 96 ppid: 2 flags:0x00004000 > [ 369.861041] Workqueue: writeback wb_workfn (flush-cifs-2) > [ 369.862095] Call Trace: > [ 369.862583] __schedule+0x401/0x8e0 > [ 369.863280] schedule+0x49/0x130 > [ 369.863912] io_schedule+0x12/0x40 > [ 369.864604] __lock_page+0x147/0x3a0 > [ 369.865322] ? add_to_page_cache_lru+0x180/0x180 > [ 369.866246] cifs_writepages+0x620/0xf60 > [ 369.867005] do_writepages+0x43/0xe0 > [ 369.867737] ? __blk_mq_try_issue_directly+0x121/0x1c0 > [ 369.868750] __writeback_single_inode+0x3d/0x320 > [ 369.869589] writeback_sb_inodes+0x20d/0x480 > [ 369.870367] __writeback_inodes_wb+0x4c/0xe0 > [ 369.871148] wb_writeback+0x201/0x2f0 > [ 369.871797] wb_workfn+0x38a/0x4e0 > [ 369.872427] ? check_preempt_curr+0x47/0x70 > [ 369.873191] ? ttwu_do_wakeup.isra.0+0x17/0x170 > [ 369.873999] process_one_work+0x1df/0x3b0 > [ 369.874741] worker_thread+0x4a/0x3c0 > [ 369.875421] ? process_one_work+0x3b0/0x3b0 > [ 369.876180] kthread+0x125/0x160 > [ 369.876761] ? kthread_park+0x90/0x90 > [ 369.877431] ret_from_fork+0x22/0x30 > [ 369.878106] INFO: task a.out:465 blocked for more than 122 seconds. > [ 369.879225] Not tainted 5.10.0+ #164 > [ 369.879945] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > [ 369.881316] task:a.out state:D stack:12752 pid: 465 ppid: 386 flags:0x00000002 > [ 369.882791] Call Trace: > [ 369.883263] __schedule+0x401/0x8e0 > [ 369.883884] schedule+0x49/0x130 > [ 369.884447] io_schedule+0x12/0x40 > [ 369.885054] wait_on_page_bit+0x106/0x2e0 > [ 369.885795] ? add_to_page_cache_lru+0x180/0x180 > [ 369.886631] wait_on_page_writeback+0x25/0xd0 > [ 369.887427] cifs_writepages+0x5ee/0xf60 > [ 369.888151] do_writepages+0x43/0xe0 > [ 369.888789] ? __generic_file_write_iter+0xfd/0x1d0 > [ 369.889663] __filemap_fdatawrite_range+0xcd/0x110 > [ 369.890523] file_write_and_wait_range+0x40/0x90 > [ 369.891360] cifs_strict_fsync+0x35/0x470 > [ 369.892094] do_fsync+0x38/0x70 > [ 369.892657] __x64_sys_fsync+0x10/0x20 > [ 369.893336] do_syscall_64+0x33/0x40 > [ 369.893978] entry_SYSCALL_64_after_hwframe+0x67/0xd1 > [ 369.894883] RIP: 0033:0x7f660e208950 > [ 369.895538] RSP: 002b:00007fff52b27b78 EFLAGS: 00000202 ORIG_RAX: 000000000000004a > [ 369.896882] RAX: ffffffffffffffda RBX: 00007fff52b28cb8 RCX: 00007f660e208950 > [ 369.898139] RDX: 0000000000001000 RSI: 00007fff52b27b80 RDI: 0000000000000003 > [ 369.899395] RBP: 00007fff52b28ba0 R08: 0000000000000410 R09: 0000000000000001 > [ 369.900661] R10: 00007f660e11c400 R11: 0000000000000202 R12: 0000000000000000 > [ 369.901925] R13: 00007fff52b28cc8 R14: 00007f660e328000 R15: 000055b5aeb6fdd8 > [ 369.903202] INFO: task sync:468 blocked for more than 122 seconds. > [ 369.904311] Not tainted 5.10.0+ #164 > [ 369.905034] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > [ 369.906457] task:sync state:D stack:13632 pid: 468 ppid: 386 flags:0x00004002 > [ 369.907930] Call Trace: > [ 369.908369] __schedule+0x401/0x8e0 > [ 369.908984] schedule+0x49/0x130 > [ 369.909582] io_schedule+0x12/0x40 > [ 369.910208] wait_on_page_bit+0x106/0x2e0 > [ 369.910918] ? add_to_page_cache_lru+0x180/0x180 > [ 369.911758] wait_on_page_writeback+0x25/0xd0 > [ 369.912560] __filemap_fdatawait_range+0x83/0x110 > [ 369.913408] ? __add_pages+0x6f/0x1b0 > [ 369.914089] filemap_fdatawait_keep_errors+0x1a/0x50 > [ 369.914957] sync_inodes_sb+0x208/0x2a0 > [ 369.915666] ? __x64_sys_tee+0xd0/0xd0 > [ 369.916344] iterate_supers+0x90/0xe0 > [ 369.916983] ksys_sync+0x40/0xb0 > [ 369.917590] __do_sys_sync+0xa/0x20 > [ 369.918240] do_syscall_64+0x33/0x40 > [ 369.918884] entry_SYSCALL_64_after_hwframe+0x67/0xd1 > [ 369.919800] RIP: 0033:0x7f746d820987 > [ 369.920451] RSP: 002b:00007ffce853fd78 EFLAGS: 00000206 ORIG_RAX: 00000000000000a2 > [ 369.921798] RAX: ffffffffffffffda RBX: 00007ffce853fed8 RCX: 00007f746d820987 > [ 369.923063] RDX: 00007f746d8f4801 RSI: 00007ffce8541f71 RDI: 00007f746d8b05ad > [ 369.924339] RBP: 0000000000000001 R08: 000000000000ffff R09: 0000000000000000 > [ 369.925605] R10: 00007f746d7308a0 R11: 0000000000000206 R12: 000055b8487470fb > [ 369.926866] R13: 0000000000000000 R14: 0000000000000000 R15: 000055b848749ce0 > [ 369.928138] Kernel panic - not syncing: hung_task: blocked tasks > [ 369.929191] CPU: 3 PID: 35 Comm: khungtaskd Not tainted 5.10.0+ #164 > [ 369.952450] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.1-2.fc37 04/01/2014 > [ 369.956984] Call Trace: > [ 369.957973] dump_stack+0x57/0x6e > [ 369.959273] panic+0x115/0x2f1 > [ 369.960476] watchdog.cold+0xb5/0xb5 > [ 369.961884] ? hungtask_pm_notify+0x40/0x40 > [ 369.963310] kthread+0x125/0x160 > [ 369.964354] ? kthread_park+0x90/0x90 > [ 369.965551] ret_from_fork+0x22/0x30 > [ 369.967673] Kernel Offset: 0xd600000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff) > [ 369.971025] ---[ end Kernel panic - not syncing: hung_task: blocked tasks ]--- > ``` > > Scenarios 2: > > Scenario 2 occurs in strict cache mode > > ``` > P1 (dd) P2 (cifsd) P3 (cifsiod) > > cifs_strict_writev > cifs_zap_mapping - If something breaks the oplock > cifs_revalidate_mapping > cifs_invalidate_mapping > invalidate_inode_pages2 > invalidate_inode_pages2_range > lock_page - [1] > wait_on_page_writeback - [2] Waiting for writeback, blocked by [4] > wait_on_page_bit > cifs_demultiplex_thread > cifs_read_from_socket > cifs_readv_from_socket > - If another process triggers reconnect at this point > cifs_reconnect > - mid->mid_state updated to MID_RETRY_NEEDED > smb2_writev_callback mid_entry->callback() > - mid_state leads to wdata->result = -EAGAIN > wdata->result = -EAGAIN > queue_work(cifsiod_wq, &wdata->work); > cifs_writev_complete - worker function > - wdata->result == -EAGAIN Condition satisfied > cifs_writev_requeue > lock_page - [3] Blocked by [1] > end_page_writeback > - [4] Won't execute > unlock_page > ``` > > Mainline refactoring commit 3ee1a1fc3981 ("cifs: Cut over to using netfslib") > directly terminates the file write instead of resending data when smb2_writev_callback() > detects a write failure, thus avoiding this problem. This patch is introduced > in v6.10-rc1. Therefore, scenario 2 affects LTS versions from linux-5.4 > to linux-6.6. > > ``` > cat /proc/522/stack > [<0>] wait_on_page_bit+0x106/0x150 > [<0>] invalidate_inode_pages2_range+0x2cc/0x580 > [<0>] cifs_invalidate_mapping+0x2c/0x50 [cifs] > [<0>] cifs_revalidate_mapping+0x4c/0x90 [cifs] > [<0>] cifs_strict_writev+0x17a/0x250 [cifs] > [<0>] __vfs_write+0x14f/0x1b0 > [<0>] vfs_write+0xb6/0x1a0 > [<0>] ksys_write+0x57/0xd0 > [<0>] do_syscall_64+0x63/0x250 > [<0>] entry_SYSCALL_64_after_hwframe+0x5c/0xc1 > [<0>] 0xffffffffffffffff > > cat /proc/33/stack > [<0>] __lock_page+0x10c/0x160 > [<0>] cifs_writev_requeue.cold+0x17e/0x239 [cifs] > [<0>] process_one_work+0x1a9/0x3f0 > [<0>] worker_thread+0x50/0x3c0 > [<0>] kthread+0x117/0x130 > [<0>] ret_from_fork+0x35/0x40 > [<0>] 0xffffffffffffffff > ``` > > > The root cause of the deadlock problem is that the page/folio is locked again in > cifs_writev_requeue(). In order to safely fix it on the LTS branches, I would > like to clarify the following questions:, > > 1. Whether resending is necessary. If retransmission is not required, simply > terminating the write would avoids this problem. Is this an acceptable solution? > > 2. Is it necessary to lock the page/folio in cifs_writev_requeue()? Based on > my code screening (possibly missing), there seems to be no process that modifies > a page when it is marked as PG_writeback.Therefore, the page does not need to be > locked during wait_on_page_writeback(). > -- Thanks, Steve