On 12/4/23 8:15 AM, Chuck Lever wrote:
On Mon, Dec 04, 2023 at 04:34:00PM +0100, Wolfgang Walter wrote:
Hello,
after upgrading from stable 6.1.63 to stable 6.6.3 our nfs-server logged a
WARNING and then more and more clients hanged:
Dec 04 14:59:25 engel kernel: ------------[ cut here ]------------
Dec 04 14:59:25 engel kernel: WARNING: CPU: 17 PID: 8431 at
fs/nfsd/nfs4state.c:4919 nfsd_break_deleg_cb+0x174/0x190 [nfsd]
Dec 04 14:59:25 engel kernel: Modules linked in: cts rpcsec_gss_krb5 msr
8021q garp stp mrp llc binfmt_misc intel_rapl_msr intel_rapl_common sb_edac
x86_pkg_temp_thermal intel_powerclamp coretemp kv>
Dec 04 14:59:25 engel kernel: enclosure sd_mod usbhid t10_pi hid
crc64_rocksoft crc64 crc_t10dif crct10dif_generic ixgbe ahci xfrm_algo
xhci_pci libahci dca mdio_devres mpt3sas ehci_pci crct10dif_p>
Dec 04 14:59:25 engel kernel: CPU: 17 PID: 8431 Comm: nfsd Not tainted
6.6.3-debian64.all+1.2 #1
Dec 04 14:59:25 engel kernel: Hardware name: Supermicro X10DRi/X10DRI-T,
BIOS 1.1a 10/16/2015
Dec 04 14:59:25 engel kernel: RIP: 0010:nfsd_break_deleg_cb+0x174/0x190
[nfsd]
Dec 04 14:59:25 engel kernel: Code: 02 8c a4 c2 e9 ff fe ff ff 48 89 df be
01 00 00 00 e8 70 7c ed c2 48 8d bb 98 00 00 00 e8 b4 0e 01 00 84 c0 0f 85
2e ff ff ff <0f> 0b e9 27 ff ff ff be 02 00 00 0>
Dec 04 14:59:25 engel kernel: RSP: 0018:ffffbd57227c7a98 EFLAGS: 00010246
Dec 04 14:59:25 engel kernel: RAX: 0000000000000000 RBX: ffff94a77356e200
RCX: 0000000000000000
Dec 04 14:59:25 engel kernel: RDX: ffff94a77356e2c8 RSI: ffff94b78cf58000
RDI: 0000000000002000
Dec 04 14:59:25 engel kernel: RBP: ffff94a0392b3a34 R08: ffffbd57227c7a80
R09: 0000000000000000
Dec 04 14:59:25 engel kernel: R10: ffff94a05f4a9440 R11: 0000000000000000
R12: ffff94b8e3995b00
Dec 04 14:59:25 engel kernel: R13: ffff94a0392b3a20 R14: ffff94b8e3995b00
R15: 000000010eb733cd
Dec 04 14:59:25 engel kernel: FS: 0000000000000000(0000)
GS:ffff94b71fcc0000(0000) knlGS:0000000000000000
Dec 04 14:59:25 engel kernel: CS: 0010 DS: 0000 ES: 0000 CR0:
0000000080050033
Dec 04 14:59:25 engel kernel: CR2: 00007f9ef8554000 CR3: 000000295e020003
CR4: 00000000001706e0
Dec 04 14:59:25 engel kernel: Call Trace:
Dec 04 14:59:25 engel kernel: <TASK>
Dec 04 14:59:25 engel kernel: ? nfsd_break_deleg_cb+0x174/0x190 [nfsd]
Dec 04 14:59:25 engel kernel: ? __warn+0x81/0x130
Dec 04 14:59:25 engel kernel: ? nfsd_break_deleg_cb+0x174/0x190 [nfsd]
Dec 04 14:59:25 engel kernel: ? report_bug+0x171/0x1a0
Dec 04 14:59:25 engel kernel: ? handle_bug+0x3c/0x80
Dec 04 14:59:25 engel kernel: ? exc_invalid_op+0x17/0x70
Dec 04 14:59:25 engel kernel: ? asm_exc_invalid_op+0x1a/0x20
Dec 04 14:59:25 engel kernel: ? nfsd_break_deleg_cb+0x174/0x190 [nfsd]
Dec 04 14:59:25 engel kernel: ? nfsd_break_deleg_cb+0x9a/0x190 [nfsd]
Dec 04 14:59:25 engel kernel: __break_lease+0x25c/0x720
Dec 04 14:59:25 engel kernel: __nfsd_open.isra.0+0xa9/0x1a0 [nfsd]
Dec 04 14:59:25 engel kernel: nfsd_file_do_acquire+0x4ca/0xc50 [nfsd]
Dec 04 14:59:25 engel kernel: nfs4_get_vfs_file+0x34c/0x3b0 [nfsd]
Dec 04 14:59:25 engel kernel: nfsd4_process_open2+0x42c/0x15d0 [nfsd]
Dec 04 14:59:25 engel kernel: ? nfsd_permission+0x63/0x100 [nfsd]
Dec 04 14:59:25 engel kernel: ? fh_verify+0x42e/0x720 [nfsd]
Dec 04 14:59:25 engel kernel: nfsd4_open+0x64a/0xc40 [nfsd]
Dec 04 14:59:25 engel kernel: ? nfsd4_encode_operation+0xa7/0x2b0 [nfsd]
Dec 04 14:59:25 engel kernel: nfsd4_proc_compound+0x351/0x670 [nfsd]
Dec 04 14:59:25 engel kernel: ? __pfx_nfsd+0x10/0x10 [nfsd]
Dec 04 14:59:25 engel kernel: nfsd_dispatch+0x7c/0x1b0 [nfsd]
Dec 04 14:59:25 engel kernel: svc_process_common+0x431/0x6e0 [sunrpc]
Dec 04 14:59:25 engel kernel: ? __pfx_nfsd_dispatch+0x10/0x10 [nfsd]
Dec 04 14:59:25 engel kernel: ? __pfx_nfsd+0x10/0x10 [nfsd]
Dec 04 14:59:25 engel kernel: svc_process+0x131/0x180 [sunrpc]
Dec 04 14:59:25 engel kernel: nfsd+0x84/0xd0 [nfsd]
Dec 04 14:59:25 engel kernel: kthread+0xe5/0x120
Dec 04 14:59:25 engel kernel: ? __pfx_kthread+0x10/0x10
Dec 04 14:59:25 engel kernel: ret_from_fork+0x31/0x50
Dec 04 14:59:25 engel kernel: ? __pfx_kthread+0x10/0x10
Dec 04 14:59:25 engel kernel: ret_from_fork_asm+0x1b/0x30
Dec 04 14:59:25 engel kernel: </TASK>
Dec 04 14:59:25 engel kernel: ---[ end trace 0000000000000000 ]---
6.1. did not show such a problem.
Both are vanilla stable kernels (self-built).
Thank you for your report.
If you are able to bisect your server between v6.1 and v6.6, that
would help us narrow down the cause.
Dai, can you have a look at this?
The warning message indicates the callback work was not queued since
it was already queued. In this case we'll have taken an extra reference
to the stid that will never be put (see b95239ca4954a0), we should fix
this but I don't think this extra reference causing the client to hang.
It's hard to say what the root cause is without a core dump and/or some
network trace or a way to reproduce the problem. As Chuck mentioned, it's
best to bisect the server to help us narrow down the cause.
Wolfgang, could you provide some additional info such as how often this
problem happens, under load?, problem reproducible?, number of clients
involved, type of NFS activities, etc.
Thanks,
-Dai