Re: [PATCH v2] xfs: enable WQ_MEM_RECLAIM on m_sync_workqueue

Mike Snitzer <snitzer@xxxxxxxxxx> · Mon, 1 Jul 2024 10:13:16 -0400

On Mon, Jul 01, 2024 at 09:46:36AM +1000, Dave Chinner wrote:
> On Sun, Jun 30, 2024 at 12:35:17PM -0400, Mike Snitzer wrote:
> > The need for this fix was exposed while developing a new NFS feature
> > called "localio" which bypasses the network, if both the client and
> > server are on the same host, see:
> > https://git.kernel.org/pub/scm/linux/kernel/git/snitzer/linux.git/log/?h=nfs-localio-for-6.11
> > 
> > Because NFS's nfsiod_workqueue enables WQ_MEM_RECLAIM, writeback will
> > call into NFS and if localio is enabled the NFS client will call
> > directly into xfs_file_write_iter, this causes the following
> > backtrace when running xfstest generic/476 against NFS with localio:
> 
> Oh, that's nasty.
> 
> We now have to change every path in every filesystem that NFS can
> call that might defer work to a workqueue.
> 
> IOWs, this makes WQ_MEM_RECLAIM pretty much mandatory for front end
> workqueues in the filesystem and block layers regardless of whether
> the filesystem or block layer path runs under memory reclaim context
> or not.

As you noticed later (yet didn't circle back to edit here) this is
just triggering when space is low.

But yeah, I submitted the patch knowing it was likely not viable, I
just wasn't clear on all the whys.  I appreciate your feedback, and
yes it did feel like a slippery slope (understatement) that'd force
other filesystems to follow.

> All WQ_MEM_RECLAIM does is create a rescuer thread at workqueue
> creation that is used as a "worker of last resort" when forking new
> worker threads fail due to ENOMEM. This prevents deadlocks when
> doing GFP_KERNEL allocations in workqueue context and potentially
> deadlocking because a GFP_KERNEL allocation is blocking waiting for
> this workqueue to allocate workers to make progress.

Right.

> >   workqueue: WQ_MEM_RECLAIM writeback:wb_workfn is flushing !WQ_MEM_RECLAIM xfs-sync/vdc:xfs_flush_inodes_worker
> >   WARNING: CPU: 6 PID: 8525 at kernel/workqueue.c:3706 check_flush_dependency+0x2a4/0x328
> >   Modules linked in:
> >   CPU: 6 PID: 8525 Comm: kworker/u71:5 Not tainted 6.10.0-rc3-ktest-00032-g2b0a133403ab #18502
> >   Hardware name: linux,dummy-virt (DT)
> >   Workqueue: writeback wb_workfn (flush-0:33)
> >   pstate: 400010c5 (nZcv daIF -PAN -UAO -TCO -DIT +SSBS BTYPE=--)
> >   pc : check_flush_dependency+0x2a4/0x328
> >   lr : check_flush_dependency+0x2a4/0x328
> >   sp : ffff0000c5f06bb0
> >   x29: ffff0000c5f06bb0 x28: ffff0000c998a908 x27: 1fffe00019331521
> >   x26: ffff0000d0620900 x25: ffff0000c5f06ca0 x24: ffff8000828848c0
> >   x23: 1fffe00018be0d8e x22: ffff0000c1210000 x21: ffff0000c75fde00
> >   x20: ffff800080bfd258 x19: ffff0000cad63400 x18: ffff0000cd3a4810
> >   x17: 0000000000000000 x16: 0000000000000000 x15: ffff800080508d98
> >   x14: 0000000000000000 x13: 204d49414c434552 x12: 1fffe0001b6eeab2
> >   x11: ffff60001b6eeab2 x10: dfff800000000000 x9 : ffff60001b6eeab3
> >   x8 : 0000000000000001 x7 : 00009fffe491154e x6 : ffff0000db775593
> >   x5 : ffff0000db775590 x4 : ffff0000db775590 x3 : 0000000000000000
> >   x2 : 0000000000000027 x1 : ffff600018be0d62 x0 : dfff800000000000
> >   Call trace:
> >    check_flush_dependency+0x2a4/0x328
> >    __flush_work+0x184/0x5c8
> >    flush_work+0x18/0x28
> >    xfs_flush_inodes+0x68/0x88
> >    xfs_file_buffered_write+0x128/0x6f0
> >    xfs_file_write_iter+0x358/0x448
> >    nfs_local_doio+0x854/0x1568
> >    nfs_initiate_pgio+0x214/0x418
> >    nfs_generic_pg_pgios+0x304/0x480
> >    nfs_pageio_doio+0xe8/0x240
> >    nfs_pageio_complete+0x160/0x480
> >    nfs_writepages+0x300/0x4f0
> >    do_writepages+0x12c/0x4a0
> >    __writeback_single_inode+0xd4/0xa68
> >    writeback_sb_inodes+0x470/0xcb0
> >    __writeback_inodes_wb+0xb0/0x1d0
> >    wb_writeback+0x594/0x808
> >    wb_workfn+0x5e8/0x9e0
> >    process_scheduled_works+0x53c/0xd90
> 
> Ah, this is just the standard backing device flusher thread that is
> running. This is the back end of filesystem writeback, not the front
> end. It was never intended to be able to directly do loop back IO
> submission to the front end filesystem IO paths like this - they are
> very different contexts and have very different constraints.
> 
> This particular situation occurs when XFS is near ENOSPC.  There's a
> very high probability it is going to fail these writes, and so it's
> doing slow path work that involves blocking and front end filesystem
> processing is allowed to block on just about anything in the
> filesystem as long as it can guarantee it won't deadlock.
> 
> Fundamentally, doing IO submission in WQ_MEM_RECLAIM context changes
> the submission context for -all- filesystems, not just XFS.

Yes, I see that.

> If we have to make this change to XFS, then -every-
> workqueue in XFS (not just this one) must be converted to
> WQ_MEM_RECLAIM, and then many workqueues in all the other
> filesystems will need to have the same changes made, too.

AFAICT they are all WQ_MEM_RECLAIM aside from m_sync_workqueue, but
that's besides the point.

> That doesn't smell right to me.
> 
> ----
> 
> So let's look at how back end filesystem IO currently submits new
> front end filesystem IO: the loop block device does this, and it
> uses workqueues to defer submitted IO so that the lower IO
> submission context can be directly controlled and made with the
> front end filesystem IO submission path behaviours.
> 
> The loop device does not use rescuer threads - that's not needed
> when you have a queue based submission and just use the workqueues
> to run the queues until they are empty. So the loop device uses
> a standard unbound workqueue for it's IO submission path, and
> then when the work is running it sets the task flags to say "this is
> a nested IO worker thread" before it starts processing the
> submission queue and submitting new front end filesystem IO:
> 
> static void loop_process_work(struct loop_worker *worker,
>                         struct list_head *cmd_list, struct loop_device *lo)
> {
>         int orig_flags = current->flags;
>         struct loop_cmd *cmd;
> 
>         current->flags |= PF_LOCAL_THROTTLE | PF_MEMALLOC_NOIO;
> 
> PF_LOCAL_THROTTLE prevents deadlocks in balance_dirty_pages() by
> lifting the dirty ratio for this thread a little, hence giving it
> priority over the upper filesystem. i.e. the upper filesystem will
> throttle incoming writes first, then the back end IO submission
> thread can still submit new front end IOs to the lower filesystem
> and they won't block in balance_dirty_pages() because the lower
> filesystem has a higher limit. hence the lower filesystem can always
> drain the dirty pages on the upper filesystem, and the system won't
> deadlock in balance_dirty_pages().

Perfect, thanks for the guidance.

> Using WQ_MEM_RECLAIM context for IO submission does not address this
> deadlock.
> 
> The PF_MEMALLOC_NOIO flag prevents the lower filesystem IO from
> causing memory reclaim to re-enter filesystems or IO devices and so
> prevents deadlocks from occuring where IO that cleans pages is
> waiting on IO to complete.
> 
> Using WQ_MEM_RECLAIM context for IO submission does not address this
> deadlock either.
> 
> IOWs, doing front IO submission like this from the BDI flusher
> thread is guaranteed to deadlock sooner or later, regardless of
> whether WQ_MEM_RECLAIM is set or not on workqueues that are flushed
> during IO submission. The WQ_MEM_RECLAIM warning is effectively your
> canary in the coal mine. And the canary just carked it.

Yes, I knew it as such, but I wanted to pin down why it died.. thanks
for your help!

FYI, this is the long-standing approach to how Trond dealt with this
WQ_MEM_RECLAIM situation. I was just hoping to avoid having to
introduce a dedicated workqueue for localio's needs (NeilBrown really
disliked the fact it mentioned how it avoids blowing the stack, but we
get that as a side-effect of needing it to avoid WQ_MEM_RECLAIM):
https://git.kernel.org/pub/scm/linux/kernel/git/snitzer/linux.git/commit/?h=nfs-localio-for-6.11-testing&id=0cd25f7610df291827ad95023e03fdd4f93bbea7
(I revised the patch to add PF_LOCAL_THROTTLE | PF_MEMALLOC_NOIO and
adjusted the header to reflect our discussion in this thread).

> IMO, the only sane way to ensure this sort of nested "back-end page
> cleaning submits front-end IO filesystem IO" mechanism works is to
> do something similar to the loop device. You most definitely don't
> want to be doing buffered IO (double caching is almost always bad)
> and you want to be doing async direct IO so that the submission
> thread is not waiting on completion before the next IO is
> submitted.

Yes, follow-on work is for me to revive the directio path for localio
that ultimately wasn't pursued (or properly wired up) because it
creates DIO alignment requirements on NFS client IO:
https://git.kernel.org/pub/scm/linux/kernel/git/snitzer/linux.git/commit/?h=nfs-localio-for-6.11-testing&id=f6c9f51fca819a8af595a4eb94811c1f90051eab

But underlying filesystems (like XFS) have the appropriate checks, we
just need to fail gracefully and disable NFS localio if the IO is
misaligned.

Mike