On Fri, Nov 03, 2023 at 09:40:44AM +0000, Daire Byrne wrote: > Hi, > > We have large compute clusters that, amongst other things, spend their > day mounting & unounting lots of Linux NFS servers via autofs. This > has worked fine for many years and client kernel versions and was > working without incident even with our current v6.3.x production > kernels. > > During the v6.6-rc cycle while testing that kernel, I noticed that > every now and then, the umount/mount would hang randomly and the > compute host would get stuck and not complete it's work until a > reboot. I thought I'd wait until v6.6 was released and check again - > the issue persists. > > I have not had the opportunity to test the v6.4 & v6.5 kernels in > between yet. The stack traces look something like this: Please do bisection to find the exact commit that introduces your regression. See Documentation/admin-guide/bug-bisect.rst in the kernel sources for more information. > > [202752.264187] INFO: task umount.nfs:58118 blocked for more than 245 seconds. > [202752.264237] Tainted: G E 6.6.0-1.dneg.x86_64 #1 Can you reproduce on untainted (vanilla) kernel? > [202752.264267] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" > disables this message. > [202752.264296] task:umount.nfs state:D stack:0 pid:58118 > ppid:1 flags:0x00004002 > [202752.264304] Call Trace: > [202752.264308] <TASK> > [202752.264313] __schedule+0x30b/0xa10 > [202752.264327] schedule+0x68/0xf0 > [202752.264332] io_schedule+0x16/0x40 > [202752.264337] __folio_lock+0xfc/0x220 > [202752.264346] ? srso_alias_return_thunk+0x5/0x7f > [202752.264353] ? __pfx_wake_page_function+0x10/0x10 > [202752.264361] truncate_inode_pages_range+0x441/0x460 > [202752.264411] truncate_inode_pages_final+0x41/0x50 > [202752.264425] nfs_evict_inode+0x1a/0x40 [nfs] > [202752.264476] evict+0xdc/0x190 > [202752.264485] dispose_list+0x4d/0x70 > [202752.264491] evict_inodes+0x16b/0x1b0 > [202752.264499] generic_shutdown_super+0x3e/0x160 > [202752.264507] kill_anon_super+0x17/0x50 > [202752.264513] nfs_kill_super+0x27/0x50 [nfs] > [202752.264556] deactivate_locked_super+0x35/0x90 > [202752.264562] deactivate_super+0x42/0x50 > [202752.264568] cleanup_mnt+0x109/0x170 > [202752.264574] __cleanup_mnt+0x12/0x20 > [202752.264580] task_work_run+0x61/0x90 > [202752.264588] exit_to_user_mode_prepare+0x1d8/0x200 > [202752.264596] syscall_exit_to_user_mode+0x1c/0x40 > [202752.264603] do_syscall_64+0x48/0x90 > [202752.264609] entry_SYSCALL_64_after_hwframe+0x6e/0xd8 > [202752.264617] RIP: 0033:0x7fcb9befeba7 > [202752.264622] RSP: 002b:00007ffdd63ef348 EFLAGS: 00000246 ORIG_RAX: > 00000000000000a6 > [202752.264628] RAX: 0000000000000000 RBX: 00005561e35da010 RCX: > 00007fcb9befeba7 > [202752.264632] RDX: 0000000000000001 RSI: 0000000000000000 RDI: > 00005561e35da1e0 > [202752.264634] RBP: 00005561e35da1e0 R08: 00005561e35dbfa0 R09: > 00005561e35db790 > [202752.264637] R10: 00007ffdd63eeda0 R11: 0000000000000246 R12: > 00007fcb9c442d78 > [202752.264640] R13: 0000000000000000 R14: 00005561e35db2c0 R15: > 00007ffdd63f0dcb > [202752.264648] </TASK> > > [202752.264658] INFO: task mount.nfs:60827 blocked for more than 122 seconds. > [202752.264686] Tainted: G E 6.6.0-1.dneg.x86_64 #1 > [202752.264713] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" > disables this message. > [202752.264743] task:mount.nfs state:D stack:0 pid:60827 > ppid:60826 flags:0x00004000 > [202752.264751] Call Trace: > [202752.264753] <TASK> > [202752.264757] __schedule+0x30b/0xa10 > [202752.264763] ? srso_alias_return_thunk+0x5/0x7f > [202752.264771] schedule+0x68/0xf0 > [202752.264776] schedule_preempt_disabled+0x15/0x30 > [202752.264782] rwsem_down_write_slowpath+0x2b3/0x640 > [202752.264788] ? try_to_wake_up+0x242/0x5f0 > [202752.264797] ? __x86_indirect_jump_thunk_r15+0x20/0x20 > [202752.264803] ? wake_up_q+0x50/0x90 > [202752.264809] down_write+0x55/0x70 > [202752.264815] super_lock+0x44/0x130 > [202752.264821] ? kernfs_activate+0x54/0x60 > [202752.264828] ? srso_alias_return_thunk+0x5/0x7f > [202752.264833] ? kernfs_add_one+0x11f/0x160 > [202752.264841] grab_super+0x2e/0x80 > [202752.264847] grab_super_dead+0x31/0xe0 > [202752.264855] ? srso_alias_return_thunk+0x5/0x7f > [202752.264860] ? sysfs_create_link_nowarn+0x22/0x40 > [202752.264865] ? srso_alias_return_thunk+0x5/0x7f > [202752.264871] ? __pfx_nfs_compare_super+0x10/0x10 [nfs] > [202752.264915] sget_fc+0xd4/0x280 > [202752.264921] ? __pfx_nfs_set_super+0x10/0x10 [nfs] > [202752.264965] nfs_get_tree_common+0x86/0x520 [nfs] > [202752.265009] nfs_try_get_tree+0x5c/0x2e0 [nfs] > [202752.265052] ? srso_alias_return_thunk+0x5/0x7f > [202752.265058] ? try_module_get+0x1d/0x30 > [202752.265064] ? srso_alias_return_thunk+0x5/0x7f > [202752.265068] ? get_nfs_version+0x29/0x90 [nfs] > [202752.265111] ? srso_alias_return_thunk+0x5/0x7f > [202752.265116] ? nfs_fs_context_validate+0x4fe/0x710 [nfs] > [202752.265163] nfs_get_tree+0x38/0x60 [nfs] > [202752.265202] vfs_get_tree+0x2a/0xe0 > [202752.265207] ? capable+0x19/0x20 > [202752.265213] path_mount+0x2fe/0xa90 > [202752.265219] ? putname+0x55/0x70 > [202752.265226] do_mount+0x80/0xa0 > [202752.265233] __x64_sys_mount+0x8b/0xe0 > [202752.265240] do_syscall_64+0x3b/0x90 > [202752.265245] entry_SYSCALL_64_after_hwframe+0x6e/0xd8 > [202752.265250] RIP: 0033:0x7fbf5d4ff26a > [202752.265253] RSP: 002b:00007ffeb24fdd98 EFLAGS: 00000202 ORIG_RAX: > 00000000000000a5 > [202752.265258] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: > 00007fbf5d4ff26a > [202752.265261] RDX: 000055c814e78100 RSI: 000055c814e771e0 RDI: > 000055c814e77320 > [202752.265264] RBP: 00007ffeb24fdfb0 R08: 000055c814e85510 R09: > 000000000000006d > [202752.265266] R10: 0000000000000004 R11: 0000000000000202 R12: > 00007fbf5e2307e0 > [202752.265269] R13: 00007ffeb24fdfb0 R14: 00007ffeb24fde90 R15: > 000055c814e855a0 > [202752.265277] </TASK> > > And like I said, the mount/umount against the server hangs > indefinitely on the client. It is somewhat interesting that autofs > still tries to trigger a subsequent mount even though the umount has > not completed. > > The NFS servers are running RHEL8.5 and we are using NFSv3. I also > reproduced it with a fairly recent nfs-utils-2.6.2 on the client > compute hosts. What distro on the client? > > Because these happen quite rarely, it takes time and many clients and > mount/umount cycles to reproduce, so I thought I'd post here before > working through the bisect testing. If you think this is better as a > kernel.org bugzilla ticket, I'm happy to do that too. For now, posting to the ML is preferred as many developers don't take a look on bugzilla.kernel.org. Thanks. -- An old man doll... just what I always wanted! - Clara
Attachment:
signature.asc
Description: PGP signature