On Sat, 28 Sep 2024, Harald Dunkel wrote: > Hi folks, > > On 2024-09-21 09:58:55, Harald Dunkel wrote: > > NeilBrown wrote: > >> > >> We can guess though. It isn't waiting for a lock - that would show in > >> the above list - so it might be waiting for a wakeup, or might be > >> spinning. > >> The only wake-up I can imagine is in one of the memory-allocation calls, > >> but if the system were running out of memory we would probably see > >> messages about that. > >> > > > > I have seen something like this. I am running NFS inside a container, > > using legacy cgroup. When it got stuck it claimed I cannot login > > into the container due to out of memory. When it happens again, I > > can send you the exact error message. The next hung nfsd is overdue, > > anyway. > > > > my NFS server got stuck again last night. Unfortunately the service was > recovered by a colleague, so I had no chance to check the memory. Attached > you can find the log files of both nfs container and LXC server, with > /proc/sys/kernel/hung_task_all_cpu_backtrace set to 1. > > I dropped the kernel mailing list from this reply, due to large attachments. > Hopefully this was OK? > > Hope this helps. Please mail if I can help > Harri > Thanks for the logs. The point to flush_workqueue() being a problem, presumably from nfsd4_probe_callback_sync(), though I'm not 100% sure of that. Maybe some deadlock in the callback code. I'm not very familiar with that code and nothing immediately jumps out. I had thought that hung_task_all_cpu_backtrace would show a backtrace of *all* tasks - I missed the "cpu" in there. If if it happens again and if you can echo t > /proc/sysrq-trigger to get stack traces of everything, that might help. Maybe it won't be necessary if I or someone else can spot a deadlock with flush_workqueue(). Thanks, NeilBrown