Re: autofs mount/umount hangs with recent kernel?

Bagas Sanjaya <bagasdotme@xxxxxxxxx> · Fri, 3 Nov 2023 21:10:35 +0700

On Fri, Nov 03, 2023 at 09:40:44AM +0000, Daire Byrne wrote:
> Hi,
> 
> We have large compute clusters that, amongst other things, spend their
> day mounting & unounting lots of Linux NFS servers via autofs. This
> has worked fine for many years and client kernel versions and was
> working without incident even with our current v6.3.x production
> kernels.
> 
> During the v6.6-rc cycle while testing that kernel, I noticed that
> every now and then, the umount/mount would hang randomly and the
> compute host would get stuck and not complete it's work until a
> reboot. I thought I'd wait until v6.6 was released and check again -
> the issue persists.
> 
> I have not had the opportunity to test the v6.4 & v6.5 kernels in
> between yet. The stack traces look something like this:

Please do bisection to find the exact commit that introduces your
regression. See Documentation/admin-guide/bug-bisect.rst in the kernel
sources for more information.

> 
> [202752.264187] INFO: task umount.nfs:58118 blocked for more than 245 seconds.
> [202752.264237]       Tainted: G            E      6.6.0-1.dneg.x86_64 #1

Can you reproduce on untainted (vanilla) kernel?

> [202752.264267] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
> [202752.264296] task:umount.nfs      state:D stack:0     pid:58118
> ppid:1      flags:0x00004002
> [202752.264304] Call Trace:
> [202752.264308]  <TASK>
> [202752.264313]  __schedule+0x30b/0xa10
> [202752.264327]  schedule+0x68/0xf0
> [202752.264332]  io_schedule+0x16/0x40
> [202752.264337]  __folio_lock+0xfc/0x220
> [202752.264346]  ? srso_alias_return_thunk+0x5/0x7f
> [202752.264353]  ? __pfx_wake_page_function+0x10/0x10
> [202752.264361]  truncate_inode_pages_range+0x441/0x460
> [202752.264411]  truncate_inode_pages_final+0x41/0x50
> [202752.264425]  nfs_evict_inode+0x1a/0x40 [nfs]
> [202752.264476]  evict+0xdc/0x190
> [202752.264485]  dispose_list+0x4d/0x70
> [202752.264491]  evict_inodes+0x16b/0x1b0
> [202752.264499]  generic_shutdown_super+0x3e/0x160
> [202752.264507]  kill_anon_super+0x17/0x50
> [202752.264513]  nfs_kill_super+0x27/0x50 [nfs]
> [202752.264556]  deactivate_locked_super+0x35/0x90
> [202752.264562]  deactivate_super+0x42/0x50
> [202752.264568]  cleanup_mnt+0x109/0x170
> [202752.264574]  __cleanup_mnt+0x12/0x20
> [202752.264580]  task_work_run+0x61/0x90
> [202752.264588]  exit_to_user_mode_prepare+0x1d8/0x200
> [202752.264596]  syscall_exit_to_user_mode+0x1c/0x40
> [202752.264603]  do_syscall_64+0x48/0x90
> [202752.264609]  entry_SYSCALL_64_after_hwframe+0x6e/0xd8
> [202752.264617] RIP: 0033:0x7fcb9befeba7
> [202752.264622] RSP: 002b:00007ffdd63ef348 EFLAGS: 00000246 ORIG_RAX:
> 00000000000000a6
> [202752.264628] RAX: 0000000000000000 RBX: 00005561e35da010 RCX:
> 00007fcb9befeba7
> [202752.264632] RDX: 0000000000000001 RSI: 0000000000000000 RDI:
> 00005561e35da1e0
> [202752.264634] RBP: 00005561e35da1e0 R08: 00005561e35dbfa0 R09:
> 00005561e35db790
> [202752.264637] R10: 00007ffdd63eeda0 R11: 0000000000000246 R12:
> 00007fcb9c442d78
> [202752.264640] R13: 0000000000000000 R14: 00005561e35db2c0 R15:
> 00007ffdd63f0dcb
> [202752.264648]  </TASK>
> 
> [202752.264658] INFO: task mount.nfs:60827 blocked for more than 122 seconds.
> [202752.264686]       Tainted: G            E      6.6.0-1.dneg.x86_64 #1
> [202752.264713] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
> [202752.264743] task:mount.nfs       state:D stack:0     pid:60827
> ppid:60826  flags:0x00004000
> [202752.264751] Call Trace:
> [202752.264753]  <TASK>
> [202752.264757]  __schedule+0x30b/0xa10
> [202752.264763]  ? srso_alias_return_thunk+0x5/0x7f
> [202752.264771]  schedule+0x68/0xf0
> [202752.264776]  schedule_preempt_disabled+0x15/0x30
> [202752.264782]  rwsem_down_write_slowpath+0x2b3/0x640
> [202752.264788]  ? try_to_wake_up+0x242/0x5f0
> [202752.264797]  ? __x86_indirect_jump_thunk_r15+0x20/0x20
> [202752.264803]  ? wake_up_q+0x50/0x90
> [202752.264809]  down_write+0x55/0x70
> [202752.264815]  super_lock+0x44/0x130
> [202752.264821]  ? kernfs_activate+0x54/0x60
> [202752.264828]  ? srso_alias_return_thunk+0x5/0x7f
> [202752.264833]  ? kernfs_add_one+0x11f/0x160
> [202752.264841]  grab_super+0x2e/0x80
> [202752.264847]  grab_super_dead+0x31/0xe0
> [202752.264855]  ? srso_alias_return_thunk+0x5/0x7f
> [202752.264860]  ? sysfs_create_link_nowarn+0x22/0x40
> [202752.264865]  ? srso_alias_return_thunk+0x5/0x7f
> [202752.264871]  ? __pfx_nfs_compare_super+0x10/0x10 [nfs]
> [202752.264915]  sget_fc+0xd4/0x280
> [202752.264921]  ? __pfx_nfs_set_super+0x10/0x10 [nfs]
> [202752.264965]  nfs_get_tree_common+0x86/0x520 [nfs]
> [202752.265009]  nfs_try_get_tree+0x5c/0x2e0 [nfs]
> [202752.265052]  ? srso_alias_return_thunk+0x5/0x7f
> [202752.265058]  ? try_module_get+0x1d/0x30
> [202752.265064]  ? srso_alias_return_thunk+0x5/0x7f
> [202752.265068]  ? get_nfs_version+0x29/0x90 [nfs]
> [202752.265111]  ? srso_alias_return_thunk+0x5/0x7f
> [202752.265116]  ? nfs_fs_context_validate+0x4fe/0x710 [nfs]
> [202752.265163]  nfs_get_tree+0x38/0x60 [nfs]
> [202752.265202]  vfs_get_tree+0x2a/0xe0
> [202752.265207]  ? capable+0x19/0x20
> [202752.265213]  path_mount+0x2fe/0xa90
> [202752.265219]  ? putname+0x55/0x70
> [202752.265226]  do_mount+0x80/0xa0
> [202752.265233]  __x64_sys_mount+0x8b/0xe0
> [202752.265240]  do_syscall_64+0x3b/0x90
> [202752.265245]  entry_SYSCALL_64_after_hwframe+0x6e/0xd8
> [202752.265250] RIP: 0033:0x7fbf5d4ff26a
> [202752.265253] RSP: 002b:00007ffeb24fdd98 EFLAGS: 00000202 ORIG_RAX:
> 00000000000000a5
> [202752.265258] RAX: ffffffffffffffda RBX: 0000000000000000 RCX:
> 00007fbf5d4ff26a
> [202752.265261] RDX: 000055c814e78100 RSI: 000055c814e771e0 RDI:
> 000055c814e77320
> [202752.265264] RBP: 00007ffeb24fdfb0 R08: 000055c814e85510 R09:
> 000000000000006d
> [202752.265266] R10: 0000000000000004 R11: 0000000000000202 R12:
> 00007fbf5e2307e0
> [202752.265269] R13: 00007ffeb24fdfb0 R14: 00007ffeb24fde90 R15:
> 000055c814e855a0
> [202752.265277]  </TASK>
> 
> And like I said, the mount/umount against the server hangs
> indefinitely on the client. It is somewhat interesting that autofs
> still tries to trigger a subsequent mount even though the umount has
> not completed.
> 
> The NFS servers are running RHEL8.5 and we are using NFSv3. I also
> reproduced it with a fairly recent nfs-utils-2.6.2 on the client
> compute hosts.

What distro on the client?

> 
> Because these happen quite rarely, it takes time and many clients and
> mount/umount cycles to reproduce, so I thought I'd post here before
> working through the bisect testing. If you think this is better as a
> kernel.org bugzilla ticket, I'm happy to do that too.

For now, posting to the ML is preferred as many developers don't take
a look on bugzilla.kernel.org.

Thanks.

-- 
An old man doll... just what I always wanted! - Clara
Attachment:
signature.asc

Description: PGP signature