On Mon, 2022-08-22 at 16:43 +0200, Igor Raits wrote: > [You don't often get email from igor@xxxxxxxxxxxx. Learn why this is > important at https://aka.ms/LearnAboutSenderIdentification ;] > > Hello Trond, > > On Mon, Aug 22, 2022 at 4:02 PM Trond Myklebust > <trondmy@xxxxxxxxxxxxxxx> wrote: > > > > On Mon, 2022-08-22 at 10:16 +0200, Igor Raits wrote: > > > [You don't often get email from igor@xxxxxxxxxxxx. Learn why this > > > is > > > important at https://aka.ms/LearnAboutSenderIdentification ;] > > > > > > Hello everyone, > > > > > > Hopefully I'm sending this to the right place… > > > We recently started to see the following stacktrace quite often > > > on > > > our > > > VMs that are using NFS extensively (I think after upgrading to > > > 5.18.11+, but not sure when exactly. For sure it happens on > > > 5.18.15): > > > > > > INFO: task kworker/u36:10:377691 blocked for more than 122 > > > seconds. > > > Tainted: G E 5.18.15-1.gdc.el8.x86_64 #1 > > > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this > > > message. > > > task:kworker/u36:10 state:D stack: 0 pid:377691 ppid: 2 > > > flags:0x00004000 > > > Workqueue: writeback wb_workfn (flush-0:308) > > > Call Trace: > > > <TASK> > > > __schedule+0x38c/0x7d0 > > > schedule+0x41/0xb0 > > > io_schedule+0x12/0x40 > > > __folio_lock+0x110/0x260 > > > ? filemap_alloc_folio+0x90/0x90 > > > write_cache_pages+0x1e3/0x4d0 > > > ? nfs_writepage_locked+0x1d0/0x1d0 [nfs] > > > nfs_writepages+0xe1/0x200 [nfs] > > > do_writepages+0xd2/0x1b0 > > > ? check_preempt_curr+0x47/0x70 > > > ? ttwu_do_wakeup+0x17/0x180 > > > __writeback_single_inode+0x41/0x360 > > > writeback_sb_inodes+0x1f0/0x460 > > > __writeback_inodes_wb+0x5f/0xd0 > > > wb_writeback+0x235/0x2d0 > > > wb_workfn+0x348/0x4a0 > > > ? put_prev_task_fair+0x1b/0x30 > > > ? pick_next_task+0x84/0x940 > > > ? __update_idle_core+0x1b/0xb0 > > > process_one_work+0x1c5/0x390 > > > worker_thread+0x30/0x360 > > > ? process_one_work+0x390/0x390 > > > kthread+0xd7/0x100 > > > ? kthread_complete_and_exit+0x20/0x20 > > > ret_from_fork+0x1f/0x30 > > > </TASK> > > > > > > I see that something very similar was fixed in btrfs > > > ( > > > https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commi > > > t/?h=linux-5.18.y&id=9535ec371d741fa037e37eddc0a5b25ba82d0027) > > > but I could not find anything similar for NFS. > > > > > > Do you happen to know if this is already fixed? If so, would you > > > mind > > > sharing some commits? If not, could you help getting this > > > addressed? > > > > > > > The stack trace you show above isn't particularly helpful for > > diagnosing what the problem is. > > > > All it is saying is that 'thread A' is waiting to take a page lock > > that > > is being held by a different 'thread B'. Without information on > > what > > 'thread B' is doing, and why it isn't releasing the lock, there is > > nothing we can conclude. > > Do you have some hint how to debug this issue further (when it > happens > again)? Would `virsh dump` to get a memory dump and then some kind of > "bt all" via crash help to get more information? > Or something else? > > Thanks in advance! > -- > Igor Raits Please try running the following two lines of 'bash' script as root: (for tt in $(grep -l 'nfs[^d]' /proc/*/stack); do echo "${tt}:"; cat ${tt}; echo; done) >/tmp/nfs_threads.txt cat /sys/kernel/debug/sunrpc/rpc_clnt/*/tasks > /tmp/rpc_tasks.txt and then send us the output from the two files /tmp/nfs_threads.txt and /tmp/rpc_tasks.txt. The file nfs_threads.txt gives us a full set of stack traces from all processes that are currently in the NFS client code. So it should contain both the stack trace from your 'thread A' above, and the traces from all candidates for the 'thread B' process that is causing the blockage. The file rpc_tasks.txt gives us the status of any RPC calls that might be outstanding and might help diagnose any issues with the TCP connection. That should therefore give us a better starting point for root causing the problem. -- Trond Myklebust Linux NFS client maintainer, Hammerspace trond.myklebust@xxxxxxxxxxxxxxx