On 1/16/25 4:03 AM, Rik Theys wrote:
The laundromat failure mode is not blocked in rpc_shutdown_client, so
there aren't any outstanding callback RPCs to observe.
The DESTROY_SESSION failure mode is blocking on the flush_workqueue call
in nfsd4_shutdown_callback(), while this failure mode appears to have
passed that call and blocked on the wait for in-flight RPCs to go to
zero (as Jeff analyzed a few days ago).
If I look at the trace, nfs4_laundromat calls
nfs4_process_client_reaplist, which calls __destroy_client at some
point.
When I look at the __destroy_client function in nfs4state.c, I see it
does a spin_lock(&state_lock) and spin_unlock(&state_lock) to perform
certain actions, but it seems the lock is not (again) acquired when
the nfsd4_shutdown_callback() function is called? According to the
comment above the nfsd4_shutdown_callback function in nfs4callback.c,
the function must be called under the state lock? Is it possible that
the function is called without this state lock? Or is the comment no
longer relevant?
The comment is stale.
Commit b687f6863eed ("nfsd: remove the client_mutex and the nfs4_lock/
unlock_state wrappers") removed the mutex that used to wrap calls to
this function.
Another thing I've noticed (but I'm not sure it's relevant here) is
that there's a client in /proc/nfs/nfsd/clients that has a states file
that crashes nfsdclnts as the field does not have a "superblock"
field:
# cat 8536/{info,states}
clientid: 0x6d0596d0675df2b3
address: "10.87.29.32:864"
status: courtesy
seconds from last renew: 2807740
name: "Linux NFSv4.2 betelgeuse.esat.kuleuven.be"
minor version: 2
Implementation domain: "kernel.org"
Implementation name: "Linux 4.18.0-553.32.1.el8_10.x86_64 #1 SMP Wed
Dec 11 16:33:48 UTC 2024 x86_64"
Implementation time: [0, 0]
callback state: UNKNOWN
callback address: 10.87.29.32:0
admin-revoked states: 0
- 0x00000001b3f25d67d096056d19facf00: { type: deleg, access: w }
This is one of the clients that has multiple entries in the
/proc/fs/nfsd/clients directory, but of all the clients that have
duplicate entries, this is the only one where the "broken" client is
in the "courtesy" state for a long time now. It's also the only
"broken" client that still has an entry in the states file. The others
are all in the "unconfirmed" state and the states file is empty.
Likely that client entry is pinned somehow by this bug.
--
Chuck Lever