Re: Regression: deadlock in io_schedule / nfs_writepage_locked

Igor Raits <igor@xxxxxxxxxxxx> · Mon, 22 Aug 2022 16:43:03 +0200

Hello Trond,

On Mon, Aug 22, 2022 at 4:02 PM Trond Myklebust <trondmy@xxxxxxxxxxxxxxx> wrote:
>
> On Mon, 2022-08-22 at 10:16 +0200, Igor Raits wrote:
> > [You don't often get email from igor@xxxxxxxxxxxx. Learn why this is
> > important at https://aka.ms/LearnAboutSenderIdentification ]
> >
> > Hello everyone,
> >
> > Hopefully I'm sending this to the right place…
> > We recently started to see the following stacktrace quite often on
> > our
> > VMs that are using NFS extensively (I think after upgrading to
> > 5.18.11+, but not sure when exactly. For sure it happens on 5.18.15):
> >
> > INFO: task kworker/u36:10:377691 blocked for more than 122 seconds.
> >      Tainted: G            E     5.18.15-1.gdc.el8.x86_64 #1
> > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
> > message.
> > task:kworker/u36:10  state:D stack:    0 pid:377691 ppid:     2
> > flags:0x00004000
> > Workqueue: writeback wb_workfn (flush-0:308)
> > Call Trace:
> > <TASK>
> > __schedule+0x38c/0x7d0
> > schedule+0x41/0xb0
> > io_schedule+0x12/0x40
> > __folio_lock+0x110/0x260
> > ? filemap_alloc_folio+0x90/0x90
> > write_cache_pages+0x1e3/0x4d0
> > ? nfs_writepage_locked+0x1d0/0x1d0 [nfs]
> > nfs_writepages+0xe1/0x200 [nfs]
> > do_writepages+0xd2/0x1b0
> > ? check_preempt_curr+0x47/0x70
> > ? ttwu_do_wakeup+0x17/0x180
> > __writeback_single_inode+0x41/0x360
> > writeback_sb_inodes+0x1f0/0x460
> > __writeback_inodes_wb+0x5f/0xd0
> > wb_writeback+0x235/0x2d0
> > wb_workfn+0x348/0x4a0
> > ? put_prev_task_fair+0x1b/0x30
> > ? pick_next_task+0x84/0x940
> > ? __update_idle_core+0x1b/0xb0
> > process_one_work+0x1c5/0x390
> > worker_thread+0x30/0x360
> > ? process_one_work+0x390/0x390
> > kthread+0xd7/0x100
> > ? kthread_complete_and_exit+0x20/0x20
> > ret_from_fork+0x1f/0x30
> > </TASK>
> >
> > I see that something very similar was fixed in btrfs
> > (
> > https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commi
> > t/?h=linux-5.18.y&id=9535ec371d741fa037e37eddc0a5b25ba82d0027)
> > but I could not find anything similar for NFS.
> >
> > Do you happen to know if this is already fixed? If so, would you mind
> > sharing some commits? If not, could you help getting this addressed?
> >
>
> The stack trace you show above isn't particularly helpful for
> diagnosing what the problem is.
>
> All it is saying is that 'thread A' is waiting to take a page lock that
> is being held by a different 'thread B'. Without information on what
> 'thread B' is doing, and why it isn't releasing the lock, there is
> nothing we can conclude.

Do you have some hint how to debug this issue further (when it happens
again)? Would `virsh dump` to get a memory dump and then some kind of
"bt all" via crash help to get more information?
Or something else?

Thanks in advance!
-- 
Igor Raits