https://bugzilla.kernel.org/show_bug.cgi?id=214767 Bug ID: 214767 Summary: xfs seems to hang due to race condition? maybe related to (gratuitous) thaw. Product: File System Version: 2.5 Kernel Version: 5.10.70 Hardware: x86-64 OS: Linux Tree: Mainline Status: NEW Severity: high Priority: P1 Component: XFS Assignee: filesystem_xfs@xxxxxxxxxxxxxxxxxxxxxx Reporter: ct@xxxxxxxxxxxxxxx Regression: No We have been getting somewhat frequently (around 1 per day for about 500 VMs) blocked tasks with tracebacks always similar to this: First we get: [656898.010322] INFO: task kworker/u2:1:458736 blocked for more than 122 seconds. [656898.011988] Not tainted 5.10.70 #1-NixOS [656898.012981] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [656898.014679] task:kworker/u2:1 state:D stack: 0 pid:458736 ppid: 2 flags:0x00004000 [656898.016530] Workqueue: writeback wb_workfn (flush-253:0) [656898.017698] Call Trace: [656898.018310] __schedule+0x271/0x860 [656898.019208] schedule+0x46/0xb0 [656898.019944] rwsem_down_read_slowpath+0x169/0x490 [656898.021008] ? mempool_alloc+0x62/0x170 [656898.022032] xfs_map_blocks+0xb9/0x400 [xfs] [656898.022998] iomap_do_writepage+0x163/0x850 [656898.023587] ? __mod_memcg_lruvec_state+0x21/0xe0 [656898.024044] write_cache_pages+0x186/0x3d0 [656898.024463] ? iomap_migrate_page+0xc0/0xc0 [656898.024882] ? submit_bio_noacct+0x3a9/0x420 [656898.025303] iomap_writepages+0x1c/0x40 [656898.025712] xfs_vm_writepages+0x64/0x90 [xfs] [656898.026146] do_writepages+0x34/0xc0 [656898.026513] __writeback_single_inode+0x39/0x2a0 [656898.026969] writeback_sb_inodes+0x200/0x470 [656898.027388] __writeback_inodes_wb+0x4c/0xe0 [656898.027805] wb_writeback+0x1d8/0x290 [656898.028161] wb_workfn+0x29b/0x4d0 [656898.028526] ? __switch_to_asm+0x42/0x70 [656898.028915] ? __switch_to+0x7b/0x3e0 [656898.029285] process_one_work+0x1df/0x370 [656898.029696] worker_thread+0x50/0x400 [656898.030054] ? process_one_work+0x370/0x370 [656898.030475] kthread+0x11b/0x140 [656898.030798] ? __kthread_bind_mask+0x60/0x60 [656898.031214] ret_from_fork+0x22/0x30 [656898.031592] INFO: task nix-daemon:459204 blocked for more than 122 seconds. [656898.032259] Not tainted 5.10.70 #1-NixOS [656898.032695] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [656898.033451] task:nix-daemon state:D stack: 0 pid:459204 ppid: 1067 flags:0x00000000 [656898.034261] Call Trace: [656898.034538] __schedule+0x271/0x860 [656898.034881] schedule+0x46/0xb0 [656898.035226] xfs_log_commit_cil+0x6a4/0x800 [xfs] [656898.035709] ? wake_up_q+0xa0/0xa0 [656898.036073] __xfs_trans_commit+0x9d/0x310 [xfs] [656898.036561] xfs_setattr_nonsize+0x342/0x520 [xfs] [656898.037027] notify_change+0x348/0x4c0 [656898.037407] ? chmod_common+0xa1/0x150 [656898.037779] chmod_common+0xa1/0x150 [656898.038139] do_fchmodat+0x5a/0xb0 [656898.038491] __x64_sys_chmod+0x17/0x20 [656898.038860] do_syscall_64+0x33/0x40 [656898.039221] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [656898.039722] RIP: 0033:0x7f3d9b3a3b77 [656898.040072] RSP: 002b:00007ffc388e2b18 EFLAGS: 00000206 ORIG_RAX: 000000000000005a [656898.040797] RAX: ffffffffffffffda RBX: 00007ffc388e2e10 RCX: 00007f3d9b3a3b77 [656898.041542] RDX: 0000000000008049 RSI: 0000000000008124 RDI: 00000000016a2400 [656898.042222] RBP: 00007ffc388e32a0 R08: 00007ffc388e4628 R09: 0000000000000000 [656898.042913] R10: 0000000000000000 R11: 0000000000000206 R12: 00007ffc388e4620 [656898.043604] R13: 0000000000000000 R14: 00007ffc388e2df0 R15: 00007ffc388e2e10 and then a while later we see this: [657020.886633] INFO: task qemu-ga:750 blocked for more than 122 seconds. [657020.888346] Not tainted 5.10.70 #1-NixOS [657020.889482] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [657020.891072] task:qemu-ga state:D stack: 0 pid: 750 ppid: 1 flags:0x00000080 [657020.891906] Call Trace: [657020.892188] __schedule+0x271/0x860 [657020.892547] ? kvm_sched_clock_read+0xd/0x20 [657020.892983] schedule+0x46/0xb0 [657020.893319] rwsem_down_write_slowpath+0x218/0x480 [657020.893802] thaw_super+0x12/0x20 [657020.894130] __x64_sys_ioctl+0x62/0xb0 [657020.894491] do_syscall_64+0x33/0x40 [657020.894860] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [657020.895396] RIP: 0033:0x7f69a55f2b17 [657020.895762] RSP: 002b:00007ffe5687b548 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 [657020.896470] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f69a55f2b17 [657020.897144] RDX: 0000000000000000 RSI: 00000000c0045878 RDI: 0000000000000006 [657020.897815] RBP: 0000000000000000 R08: 00007f69a5477c08 R09: 0000000000000000 [657020.898479] R10: 0000000000000000 R11: 0000000000000246 R12: 000055ef74a15190 [657020.899155] R13: 00007ffe5687b5c0 R14: 00007ffe5687b560 R15: 0000000000000006 [657020.899850] INFO: task kworker/u2:1:458736 blocked for more than 245 seconds. [657020.900513] Not tainted 5.10.70 #1-NixOS [657020.900942] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [657020.901672] task:kworker/u2:1 state:D stack: 0 pid:458736 ppid: 2 flags:0x00004000 [657020.902469] Workqueue: writeback wb_workfn (flush-253:0) [657020.902976] Call Trace: [657020.903220] __schedule+0x271/0x860 [657020.903556] schedule+0x46/0xb0 [657020.903870] rwsem_down_read_slowpath+0x169/0x490 [657020.904321] ? mempool_alloc+0x62/0x170 [657020.904996] xfs_map_blocks+0xb9/0x400 [xfs] [657020.905412] iomap_do_writepage+0x163/0x850 [657020.905823] ? __mod_memcg_lruvec_state+0x21/0xe0 [657020.906269] write_cache_pages+0x186/0x3d0 [657020.906664] ? iomap_migrate_page+0xc0/0xc0 [657020.907079] ? submit_bio_noacct+0x3a9/0x420 [657020.907487] iomap_writepages+0x1c/0x40 [657020.907942] xfs_vm_writepages+0x64/0x90 [xfs] [657020.908376] do_writepages+0x34/0xc0 [657020.908760] __writeback_single_inode+0x39/0x2a0 [657020.909222] writeback_sb_inodes+0x200/0x470 [657020.909635] __writeback_inodes_wb+0x4c/0xe0 [657020.910058] wb_writeback+0x1d8/0x290 [657020.910410] wb_workfn+0x29b/0x4d0 [657020.910756] ? __switch_to_asm+0x42/0x70 [657020.911135] ? __switch_to+0x7b/0x3e0 [657020.911489] process_one_work+0x1df/0x370 [657020.911912] worker_thread+0x50/0x400 [657020.912264] ? process_one_work+0x370/0x370 [657020.912667] kthread+0x11b/0x140 [657020.913008] ? __kthread_bind_mask+0x60/0x60 [657020.913413] ret_from_fork+0x22/0x30 [657020.913791] INFO: task nix-daemon:459204 blocked for more than 245 seconds. [657020.914440] Not tainted 5.10.70 #1-NixOS [657020.914879] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [657020.915612] task:nix-daemon state:D stack: 0 pid:459204 ppid: 1067 flags:0x00000000 [657020.916399] Call Trace: [657020.916654] __schedule+0x271/0x860 [657020.917006] schedule+0x46/0xb0 [657020.917343] xfs_log_commit_cil+0x6a4/0x800 [xfs] [657020.917806] ? wake_up_q+0xa0/0xa0 [657020.918162] __xfs_trans_commit+0x9d/0x310 [xfs] [657020.918633] xfs_setattr_nonsize+0x342/0x520 [xfs] [657020.919101] notify_change+0x348/0x4c0 [657020.919461] ? chmod_common+0xa1/0x150 [657020.919831] chmod_common+0xa1/0x150 [657020.920180] do_fchmodat+0x5a/0xb0 [657020.920509] __x64_sys_chmod+0x17/0x20 [657020.920878] do_syscall_64+0x33/0x40 [657020.921222] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [657020.921703] RIP: 0033:0x7f3d9b3a3b77 [657020.922055] RSP: 002b:00007ffc388e2b18 EFLAGS: 00000206 ORIG_RAX: 000000000000005a [657020.922772] RAX: ffffffffffffffda RBX: 00007ffc388e2e10 RCX: 00007f3d9b3a3b77 [657020.923435] RDX: 0000000000008049 RSI: 0000000000008124 RDI: 00000000016a2400 [657020.924115] RBP: 00007ffc388e32a0 R08: 00007ffc388e4628 R09: 0000000000000000 [657020.924787] R10: 0000000000000000 R11: 0000000000000206 R12: 00007ffc388e4620 [657020.925450] R13: 0000000000000000 R14: 00007ffc388e2df0 R15: 00007ffc388e2e10 IO is subsequently stalled on ALL disks (we have vda/vdb/vdc with two of them running XFS and one running swap). I am pretty sure that this is not primarily an issue in the hypervisor (Qemu 4.1) or storage (Ceph Jewel) because I can warm reset the virtual machine using "system-reset" and the machine will boot properly. This could be related to https://bugzilla.kernel.org/show_bug.cgi?id=207053 but it doesn't look quite right. -- You may reply to this email to add a comment. You are receiving this mail because: You are watching the assignee of the bug.