Hi Neil, On 2024-09-29 00:23:18, NeilBrown wrote:
Thanks for the logs. The point to flush_workqueue() being a problem, presumably from nfsd4_probe_callback_sync(), though I'm not 100% sure of that. Maybe some deadlock in the callback code. I'm not very familiar with that code and nothing immediately jumps out. I had thought that hung_task_all_cpu_backtrace would show a backtrace of *all* tasks - I missed the "cpu" in there. If if it happens again and if you can echo t > /proc/sysrq-trigger to get stack traces of everything, that might help. Maybe it won't be necessary if I or someone else can spot a deadlock with flush_workqueue().
I just learned that kernel.hung_task_panic = 1 should have been set, too. Sorry for that. Currently I have kernel.hung_task_panic = 1 kernel.hung_task_all_cpu_backtrace = 1 Please confirm. I have set a watchdog to run the sysrq trigger on the NFS server if df on an NFS client doesn't respond. Regards Harri