Hi, On Tue, Jan 14, 2025 at 3:51 PM Chuck Lever <chuck.lever@xxxxxxxxxx> wrote: > > On 1/14/25 3:23 AM, Rik Theys wrote: > > Hi, > > > > On Mon, Jan 13, 2025 at 11:12 PM Chuck Lever <chuck.lever@xxxxxxxxxx> wrote: > >> > >> On 1/12/25 7:42 AM, Rik Theys wrote: > >>> On Fri, Jan 10, 2025 at 11:07 PM Chuck Lever <chuck.lever@xxxxxxxxxx> wrote: > >>>> > >>>> On 1/10/25 3:51 PM, Rik Theys wrote: > >>>>> Are there any debugging commands we can run once the issue happens > >>>>> that can help to determine the cause of this issue? > >>>> > >>>> Once the issue happens, the precipitating bug has already done its > >>>> damage, so at that point it is too late. > >> > >> I've studied the code and bug reports a bit. I see one intriguing > >> mention in comment #5: > >> > >> https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1071562#5 > >> > >> /proc/130/stack: > >> [<0>] rpc_shutdown_client+0xf2/0x150 [sunrpc] > >> [<0>] nfsd4_process_cb_update+0x4c/0x270 [nfsd] > >> [<0>] nfsd4_run_cb_work+0x9f/0x150 [nfsd] > >> [<0>] process_one_work+0x1c7/0x380 > >> [<0>] worker_thread+0x4d/0x380 > >> [<0>] kthread+0xda/0x100 > >> [<0>] ret_from_fork+0x22/0x30 > >> > >> This tells me that the active item on the callback_wq is waiting for the > >> backchannel RPC client to shut down. This is probably the proximal cause > >> of the callback workqueue stall. > >> > >> rpc_shutdown_client() is waiting for the client's cl_tasks to become > >> empty. Typically this is a short wait. But here, there's one or more RPC > >> requests that are not completing. > >> > >> Please issue these two commands on your server once it gets into the > >> hung state: > >> > >> # rpcdebug -m rpc -c > >> # echo t > /proc/sysrq-trigger > > > > There were no rpcdebug entries configured, so I don't think the first > > command did much. > > > > You can find the output from the second command in attach. > > I don't see any output for "echo t > /proc/sysrq-trigger" here. What I > do see is a large number of OOM-killer notices. So, your server is out > of memory. But I think this is due to a memory leak bug, probably this > one: I'm confused. Where do you see the OOM-killer notices in the log I provided? The first lines of the file is after increasing the hung_task_warnings and waiting a few minutes. This triggered the hun task on the nfsd4 laundromat_main workqueue: Workqueue: nfsd4 laundromat_main [nfsd] Jan 14 09:06:45 kwak.esat.kuleuven.be kernel: Call Trace: Then I executed the commands you provided. Are the lines after the Jan 14 09:07:00 kwak.esat.kuleuven.be kernel: sysrq: Show State Line not the output you're looking for? Regards, Rik > > https://bugzilla.kernel.org/show_bug.cgi?id=219535 > > So that explains the source of the frequent deleg_reaper() calls on your > system; it's the shrinker. (Note these calls are not the actual problem. > The real bug is somewhere in the callback code, but frequent callbacks > are making it easy to hit the callback bug). > > Please try again, but wait until you see "INFO: task nfsd:XXXX blocked > for more than 120 seconds." in the journal before issuing the rpcdebug > and "echo t" commands. > > > > Regards, > > Rik > > > >> > >> Then gift-wrap the server's system journal and send it to me. I need to > >> see only the output from these two commands, so if you want to > >> anonymize the journal and truncate it to just the day of the failure, > >> I think that should be fine. > >> > >> > >> -- > >> Chuck Lever > > > > > > > > > -- > Chuck Lever -- Rik