Re: nfsd stuck in D (disk sleep) state

Jeff Layton <jlayton@xxxxxxxxxx> · Mon, 28 Oct 2024 13:27:11 -0400

On Mon, 2024-10-28 at 08:46 -0400, Tom Talpey wrote:
> On 10/28/2024 5:18 AM, Benoît Gschwind wrote:
> > Hello,
> > 
> > The issue trigger again, I attached the result of:
> > 
> > # dmesg -W | tee dmesg.txt
> > 
> > using:
> > 
> > # echo t > /proc/sysrq-trigger
> > 
> > I have the following PID stuck:
> > 
> >      1474 D (disk sleep)       0:54:58.602 [nfsd]
> >      1475 D (disk sleep)       0:54:58.602 [nfsd]
> >      1484 D (disk sleep)       0:54:58.602 [nfsd]
> >      1495 D (disk sleep)       0:54:58.602 [nfsd]
> 
> Hmm, 1495 is stuck in nfsd4_create_session
> 
>  > [427468.304955] task:nfsd            state:D stack:0     pid:1495 
> ppid:2      flags:0x00004000
>  > [427468.304962] Call Trace:
>  > [427468.304965]  <TASK>
>  > [427468.304971]  __schedule+0x34d/0x9e0
>  > [427468.304983]  schedule+0x5a/0xd0
>  > [427468.304991]  schedule_timeout+0x118/0x150
>  > [427468.305003]  wait_for_completion+0x86/0x160
>  > [427468.305015]  __flush_workqueue+0x152/0x420
>  > [427468.305031]  nfsd4_create_session+0x79f/0xba0 [nfsd]
>  > [427468.305092]  nfsd4_proc_compound+0x34c/0x660 [nfsd]
>  > [427468.305147]  nfsd_dispatch+0x1a1/0x2b0 [nfsd]
>  > [427468.305199]  svc_process_common+0x295/0x610 [sunrpc]
>  > [427468.305269]  ? svc_recv+0x491/0x810 [sunrpc]
>  > [427468.305337]  ? nfsd_svc+0x370/0x370 [nfsd]
>  > [427468.305389]  ? nfsd_shutdown_threads+0x90/0x90 [nfsd]
>  > [427468.305437]  svc_process+0xad/0x100 [sunrpc]
>  > [427468.305505]  nfsd+0x99/0x140 [nfsd]
>  > [427468.305555]  kthread+0xda/0x100
>  > [427468.305562]  ? kthread_complete_and_exit+0x20/0x20
>  > [427468.305572]  ret_from_fork+0x22/0x30
> 
> and the other three are stuck in nfsd4_destroy_session
> 

All 4 processes are stuck waiting on a flush_workqueue call. It's
probably one of these:

    flush_workqueue(clp->cl_callback_wq);

So the question here is really -- why are the callback workqueue jobs
stuck? I do see this:

[427468.316839] Workqueue: nfsd4_callbacks nfsd4_run_cb_work [nfsd]
[427468.316899] Call Trace:
[427468.316902]  <TASK>
[427468.316908]  __schedule+0x34d/0x9e0
[427468.316919]  schedule+0x5a/0xd0
[427468.316927]  schedule_timeout+0x94/0x150
[427468.316937]  ? __bpf_trace_tick_stop+0x10/0x10
[427468.316947]  rpc_shutdown_client+0xf2/0x150 [sunrpc]
[427468.317015]  ? cpuusage_read+0x10/0x10
[427468.317026]  nfsd4_process_cb_update+0x4c/0x270 [nfsd]
[427468.317097]  nfsd4_run_cb_work+0x9f/0x150 [nfsd]
[427468.317146]  process_one_work+0x1c7/0x380
[427468.317158]  worker_thread+0x4d/0x380
[427468.317170]  ? rescuer_thread+0x3a0/0x3a0
[427468.317177]  kthread+0xda/0x100
[427468.317185]  ? kthread_complete_and_exit+0x20/0x20
[427468.317195]  ret_from_fork+0x22/0x30
[427468.317213]  </TASK>

Maybe the RPC client is having trouble clearing clnt->cl_tasks ? 

>  > [427468.298315] task:nfsd            state:D stack:0     pid:1474 
> ppid:2      flags:0x00004000
>  > [427468.298322] Call Trace:
>  > [427468.298326]  <TASK>
>  > [427468.298332]  __schedule+0x34d/0x9e0
>  > [427468.298343]  schedule+0x5a/0xd0
>  > [427468.298350]  schedule_timeout+0x118/0x150
>  > [427468.298362]  wait_for_completion+0x86/0x160
>  > [427468.298375]  __flush_workqueue+0x152/0x420
>  > [427468.298392]  nfsd4_destroy_session+0x1b6/0x250 [nfsd]
>  > [427468.298456]  nfsd4_proc_compound+0x34c/0x660 [nfsd]
>  > [427468.298515]  nfsd_dispatch+0x1a1/0x2b0 [nfsd]
>  > [427468.298568]  svc_process_common+0x295/0x610 [sunrpc]
>  > [427468.298643]  ? svc_recv+0x491/0x810 [sunrpc]
>  > [427468.298711]  ? nfsd_svc+0x370/0x370 [nfsd]
>  > [427468.298776]  ? nfsd_shutdown_threads+0x90/0x90 [nfsd]
>  > [427468.298825]  svc_process+0xad/0x100 [sunrpc]
>  > [427468.298896]  nfsd+0x99/0x140 [nfsd]
>  > [427468.298946]  kthread+0xda/0x100
>  > [427468.298954]  ? kthread_complete_and_exit+0x20/0x20
>  > [427468.298963]  ret_from_fork+0x22/0x30
> 
> There aren't a lot of 6.1-era changes in either of these, but there
> are some interesting behavior updates around session create replay
> from early this year. I wonder if the 6.1 server is mishandling an
> nfserr_jukebox situation in nfsd4_session_create.
> 
> Was the client actually attempting to mount or unmount?
> 
> Tom.
> 
> > 
> > Thank by advance,
> > Best regards
> > 
> > Le mercredi 23 octobre 2024 à 19:38 +0000, Chuck Lever III a écrit :
> > > 
> > > 
> > > > On Oct 23, 2024, at 3:27 PM, Benoît Gschwind
> > > > <benoit.gschwind@xxxxxxxxxxxxxxxxx> wrote:
> > > > 
> > > > Hello,
> > > > 
> > > > I have a nfs server using debian 11 (Linux hostname 6.1.0-25-amd64
> > > > #1
> > > > SMP PREEMPT_DYNAMIC Debian 6.1.106-3 (2024-08-26) x86_64 GNU/Linux)
> > > > 
> > > > In some heavy workload some nfsd goes in D state and seems to never
> > > > leave this state. I did a python script to monitor how long a
> > > > process
> > > > stay in particular state and I use it to monitor nfsd state. I get
> > > > the
> > > > following result :
> > > > 
> > > > [...]
> > > > 178056 I (idle) 0:25:24.475 [nfsd]
> > > > 178057 I (idle) 0:25:24.475 [nfsd]
> > > > 178058 I (idle) 0:25:24.475 [nfsd]
> > > > 178059 I (idle) 0:25:24.475 [nfsd]
> > > > 178060 I (idle) 0:25:24.475 [nfsd]
> > > > 178061 I (idle) 0:25:24.475 [nfsd]
> > > > 178062 I (idle) 0:24:15.638 [nfsd]
> > > > 178063 I (idle) 0:24:13.488 [nfsd]
> > > > 178064 I (idle) 0:24:13.488 [nfsd]
> > > > 178065 I (idle) 0:00:00.000 [nfsd]
> > > > 178066 I (idle) 0:00:00.000 [nfsd]
> > > > 178067 I (idle) 0:00:00.000 [nfsd]
> > > > 178068 I (idle) 0:00:00.000 [nfsd]
> > > > 178069 S (sleeping) 0:00:02.147 [nfsd]
> > > > 178070 S (sleeping) 0:00:02.147 [nfsd]
> > > > 178071 S (sleeping) 0:00:02.147 [nfsd]
> > > > 178072 S (sleeping) 0:00:02.147 [nfsd]
> > > > 178073 S (sleeping) 0:00:02.147 [nfsd]
> > > > 178074 D (disk sleep) 1:29:25.809 [nfsd]
> > > > 178075 S (sleeping) 0:00:02.147 [nfsd]
> > > > 178076 S (sleeping) 0:00:02.147 [nfsd]
> > > > 178077 S (sleeping) 0:00:02.147 [nfsd]
> > > > 178078 S (sleeping) 0:00:02.147 [nfsd]
> > > > 178079 S (sleeping) 0:00:02.147 [nfsd]
> > > > 178080 D (disk sleep) 1:29:25.809 [nfsd]
> > > > 178081 D (disk sleep) 1:29:25.809 [nfsd]
> > > > 178082 D (disk sleep) 0:28:04.444 [nfsd]
> > > > 
> > > > All process not shown are in idle state. Columns are the following:
> > > > PID, state, state name, amoung of time the state did not changed
> > > > and
> > > > the process was not interrupted, and /proc/PID/status Name entry.
> > > > 
> > > > As you can read some nfsd process are in disk sleep state since
> > > > more
> > > > than 1 hour, but looking at the disk activity, there is almost no
> > > > I/O.
> > > > 
> > > > I tried to restart nfs-server but I get the following error from
> > > > the
> > > > kernel:
> > > > 
> > > > oct. 23 11:59:49 hostname kernel: rpc-srv/tcp: nfsd: got error -104
> > > > when sending 20 bytes - shutting down socket
> > > > oct. 23 11:59:49 hostname kernel: rpc-srv/tcp: nfsd: got error -104
> > > > when sending 20 bytes - shutting down socket
> > > > oct. 23 11:59:49 hostname kernel: rpc-srv/tcp: nfsd: got error -104
> > > > when sending 20 bytes - shutting down socket
> > > > oct. 23 11:59:49 hostname kernel: rpc-srv/tcp: nfsd: got error -104
> > > > when sending 20 bytes - shutting down socket
> > > > oct. 23 11:59:49 hostname kernel: rpc-srv/tcp: nfsd: got error -104
> > > > when sending 20 bytes - shutting down socket
> > > > oct. 23 11:59:59 hostname kernel: rpc-srv/tcp: nfsd: got error -104
> > > > when sending 20 bytes - shutting down socket
> > > > oct. 23 11:59:59 hostname kernel: rpc-srv/tcp: nfsd: got error -104
> > > > when sending 20 bytes - shutting down socket
> > > > oct. 23 11:59:59 hostname kernel: rpc-srv/tcp: nfsd: got error -104
> > > > when sending 20 bytes - shutting down socket
> > > > oct. 23 11:59:59 hostname kernel: rpc-srv/tcp: nfsd: got error -104
> > > > when sending 20 bytes - shutting down socket
> > > > oct. 23 11:59:59 hostname kernel: rpc-srv/tcp: nfsd: got error -104
> > > > when sending 20 bytes - shutting down socket
> > > > oct. 23 12:00:09 hostname kernel: rpc-srv/tcp: nfsd: got error -104
> > > > when sending 20 bytes - shutting down socket
> > > > oct. 23 12:00:09 hostname kernel: rpc-srv/tcp: nfsd: got error -104
> > > > when sending 20 bytes - shutting down socket
> > > > oct. 23 12:00:09 hostname kernel: rpc-srv/tcp: nfsd: got error -104
> > > > when sending 20 bytes - shutting down socket
> > > > oct. 23 12:00:10 hostname kernel: rpc-srv/tcp: nfsd: got error -32
> > > > when sending 20 bytes - shutting down socket
> > > > oct. 23 12:00:10 hostname kernel: rpc-srv/tcp: nfsd: got error -32
> > > > when sending 20 bytes - shutting down socket
> > > > 
> > > > The only way to recover seems to reboot the kernel. I guess because
> > > > the
> > > > kernel force the reboot after a given timeout.
> > > > 
> > > > My setup involve in order :
> > > > - scsi driver
> > > > - mdraid on top of scsi (raid6)
> > > > - btrfs ontop of mdraid
> > > > - nfsd ontop of btrfs
> > > > 
> > > > 
> > > > The setup is not very fast as expected, but it seems that in some
> > > > situation nfsd never leave the disk sleep state. the exports
> > > > options
> > > > are: gss/krb5i(rw,sync,no_wdelay,no_subtree_check,fsid=XXXXX). The
> > > > situation is not commun but it's always happen at some point. For
> > > > instance in the case I report here, my server booted the 2024-10-01
> > > > and
> > > > was stuck about the 2024-10-23. I did reduced by a large amount the
> > > > frequency of issue by using no_wdelay (I did thought that I did
> > > > solved
> > > > the issue when I started to use this option).
> > > > 
> > > > My guess is hadware bug, scsi bug, btrfs bug or nfsd bug ?
> > > > 
> > > > Any clue on this topic or any advice is wellcome.
> > > 
> > > Generate stack traces for each process on the system
> > > using "sudo echo t > /proc/sysrq-trigger" and then
> > > examine the output in the system journal. Note the
> > > stack contents for the processes that look stuck.
> > > 
> > > --
> > > Chuck Lever
> > > 
> > > 
> > 
> 
> 

-- 
Jeff Layton <jlayton@xxxxxxxxxx>