On Mon, 2023-11-06 at 12:11 -0800, dai.ngo@xxxxxxxxxx wrote: > > On 11/6/23 10:44 AM, Trond Myklebust wrote: > > On Mon, 2023-11-06 at 10:18 -0800, dai.ngo@xxxxxxxxxx wrote: > > > On 11/6/23 8:30 AM, Trond Myklebust wrote: > > > > Hi Dai, > > > > > > > > On Sun, 2023-11-05 at 11:27 -0800, dai.ngo@xxxxxxxxxx wrote: > > > > > Hi Trond, > > > > > > > > > > When unmonting a NFS export, nfs_free_server is called. In > > > > > nfs_free_server, > > > > > we call rpc_shutdown_client(server->client) to kill all > > > > > pending > > > > > RPC > > > > > tasks > > > > > and wait for them to terminate before continue on to call > > > > > nfs_put_client. > > > > > In nfs_put_client, if the refcounf is drecemented to 0 then > > > > > we > > > > > call > > > > > nfs_free_client which calls rpc_shutdown_client(clp- > > > > > > cl_rpcclient) to > > > > > kill all pending RPC tasks that use nfs_client->cl_rpcclient > > > > > to > > > > > send > > > > > the > > > > > request. > > > > > > > > > > Normally this works fine. However, due to some race > > > > > conditions, > > > > > if > > > > > there are > > > > > delegation return RPC tasks have not been executed yet when > > > > > nfs_free_server > > > > > is called then this can cause system to crash with general > > > > > protection > > > > > fault. > > > > > > > > > > The conditions that can cause the crash are: (1) there are > > > > > pending > > > > > delegation > > > > > return tasks called from nfs4_state_manager to return idle > > > > > delegations and > > > > > (2) the nfs_client's au_flavor is either RPC_AUTH_GSS_KRB5I > > > > > or > > > > > RPC_AUTH_GSS_KRB5P > > > > > and (3) the call to nfs_igrab_and_active, from > > > > > _nfs4_proc_delegreturn, fails > > > > > for any reasons and (4) there is a pending RPC task renewing > > > > > the > > > > > lease. > > > > > > > > > > Since the delegation return tasks were called with 'issync = > > > > > 0' > > > > > the > > > > > refcount on > > > > > nfs_server were dropped (in > > > > > nfs_client_return_marked_delegations > > > > > after RPC task > > > > > was submited to the RPC layer) and the nfs_igrab_and_active > > > > > call > > > > > fails, these > > > > > RPC tasks do not hold any refcount on the nfs_server. > > > > > > > > > > When nfs_free_server is called, rpc_shutdown_client(server- > > > > > > client) > > > > > fails to > > > > > kill these delegation return tasks since these tasks using > > > > > nfs_client->cl_rpcclient > > > > > to send the requests. When nfs_put_client is called, > > > > > nfs_free_client > > > > > is not > > > > > called because there is a pending lease renew RPC task which > > > > > uses > > > > > nfs_client->cl_rpcclient > > > > > to send the request and also adds a refcount on the > > > > > nfs_client. > > > > > This > > > > > allows > > > > > the delegation return tasks to stay alive and continue on > > > > > after > > > > > the > > > > > nfs_server > > > > > was freed. > > > > > > > > > > I've seen the NFS client with 5.4 kernel crashes with this > > > > > stack > > > > > trace: > > > > > > > > > > !# 0 [ffffb93b8fbdbd78] nfs4_setup_sequence [nfsv4] at > > > > > ffffffffc0f27e40 fs/nfs/nfs4proc.c:1041:0 > > > > > # 1 [ffffb93b8fbdbdb8] nfs4_delegreturn_prepare [nfsv4] > > > > > at > > > > > ffffffffc0f28ad1 fs/nfs/nfs4proc.c:6355:0 > > > > > # 2 [ffffb93b8fbdbdd8] rpc_prepare_task [sunrpc] at > > > > > ffffffffc05e33af net/sunrpc/sched.c:821:0 > > > > > # 3 [ffffb93b8fbdbde8] __rpc_execute [sunrpc] at > > > > > ffffffffc05eb527 > > > > > net/sunrpc/sched.c:925:0 > > > > > # 4 [ffffb93b8fbdbe48] rpc_async_schedule [sunrpc] at > > > > > ffffffffc05eb8e0 net/sunrpc/sched.c:1013:0 > > > > > # 5 [ffffb93b8fbdbe68] process_one_work at > > > > > ffffffff92ad4289 > > > > > kernel/workqueue.c:2281:0 > > > > > # 6 [ffffb93b8fbdbeb0] worker_thread at ffffffff92ad50cf > > > > > kernel/workqueue.c:2427:0 > > > > > # 7 [ffffb93b8fbdbf10] kthread at ffffffff92adac05 > > > > > kernel/kthread.c:296:0 > > > > > # 8 [ffffb93b8fbdbf58] ret_from_fork at ffffffff93600364 > > > > > arch/x86/entry/entry_64.S:355:0 > > > > > > > > > > Where the params of nfs4_setup_sequence: > > > > > client = (struct nfs_client *)0x4d54158ebc6cfc01 > > > > > args = (struct nfs4_sequence_args > > > > > *)0xffff998487f85800 > > > > > res = (struct nfs4_sequence_res *)0xffff998487f85830 > > > > > task = (struct rpc_task *)0xffff997d41da7d00 > > > > > > > > > > The 'client' pointer is invalid since it was extracted from > > > > > d_data- > > > > > > res.server->nfs_client > > > > > and the nfs_server was freed. > > > > > > > > > > I've reviewed the latest kernel 6.6-rc7, even though there > > > > > are > > > > > many > > > > > changes > > > > > since 5.4 I could not see any any changes to prevent this > > > > > scenario to > > > > > happen > > > > > so I believe this problem still exists in 6.6-rc7. > > > > > > > > > > I'd like to get your opinion on this potential issue with the > > > > > latest > > > > > kernel > > > > > and if the problem still exists then what's the best way to > > > > > fix > > > > > it. > > > > > > > > > nfs_inode_evict_delegation() should be calling > > > > nfs_do_return_delegation() with the issync flag set, whereas > > > > nfs_server_return_marked_delegations() should be holding a > > > > reference to > > > > the inode before it calls nfs_end_delegation_return(). > > > > > > > > So where is this combination no inode reference + issync=0 > > > > originating > > > > from? > > > The inode reference was dropped in > > > nfs_server_return_marked_delegations > > > after returning from nfs_end_delegation_return. > > > > > > I don't quite understand why do we continue on with the > > > delegation > > > return > > > in _nfs4_proc_delegreturn after nfs_igrab_and_active fails and > > > issync > > > = 0? > > I'm not seeing how that case can occur at all. > > Since this crash is almost reproducible with 5.4 under a test setup, > I can add a patch to crash the system when nfs_igrab_and_active fails > and issync = 0. > > > > > If nfs_server_return_marked_delegations() is holding a reference to > > the > > superblock and to the inode across the call to > > nfs_end_delegation_return(), then it should not be possible for the > > call to nfs_igrab_and_active() to fail > > Yes, this is where I'm struggle to find out how can > nfs_igrab_and_active > fails while we already had a refcount on the nfs_server. There must > be > a scenario that causes nfs_igrab_and_active to fail since the > nfs4_delegreturndata->inode is NULL in the core dump. > > Is it possible that the igrab in nfs_igrab_and_active fails, the > inode > was marked with I_FREEING | I_WILL_FREE? > The inode should never be marked with either I_FREEING or I_WILL_FREE until the call to iput_final(), which isn't called until the inode i_count is zero. -- Trond Myklebust Linux NFS client maintainer, Hammerspace trond.myklebust@xxxxxxxxxxxxxxx