Re: [syzbot] [nfs?] INFO: task hung in nfsd_umount

"NeilBrown" <neilb@xxxxxxx> · Sun, 29 Sep 2024 08:23:18 +1000

On Sat, 28 Sep 2024, Harald Dunkel wrote:
> Hi folks,
> 
> On 2024-09-21 09:58:55, Harald Dunkel wrote:
> > NeilBrown wrote:
> >>
> >> We can guess though.  It isn't waiting for a lock - that would show in
> >> the above list - so it might be waiting for a wakeup, or might be
> >> spinning.
> >> The only wake-up I can imagine is in one of the memory-allocation calls,
> >> but if the system were running out of memory we would probably see
> >> messages about that.
> >>
> > 
> > I have seen something like this. I am running NFS inside a container,
> > using legacy cgroup. When it got stuck it claimed I cannot login
> > into the container due to out of memory. When it happens again, I
> > can send you the exact error message. The next hung nfsd is overdue,
> > anyway.
> > 
> 
> my NFS server got stuck again last night. Unfortunately the service was
> recovered by a colleague, so I had no chance to check the memory. Attached
> you can find the log files of both nfs container and LXC server, with
> /proc/sys/kernel/hung_task_all_cpu_backtrace set to 1.
> 
> I dropped the kernel mailing list from this reply, due to large attachments.
> Hopefully this was OK?
> 
> Hope this helps. Please mail if I can help
> Harri
> 

Thanks for the logs.  The point to flush_workqueue() being a problem,
presumably from nfsd4_probe_callback_sync(), though I'm not 100% sure of
that.  Maybe some deadlock in the callback code.  I'm not very familiar
with that code and nothing immediately jumps out.

I had thought that hung_task_all_cpu_backtrace would show a backtrace of
*all* tasks - I missed the "cpu" in there.
If if it happens again and if you can
  echo t > /proc/sysrq-trigger
to get stack traces of everything, that might help.  Maybe it won't be
necessary if I or someone else can spot a deadlock with
flush_workqueue().

Thanks,
NeilBrown