Re: [PATCH v2] xfs: enable WQ_MEM_RECLAIM on m_sync_workqueue

Dave Chinner <david@xxxxxxxxxxxxx> · Thu, 11 Jul 2024 21:55:02 +1000

On Wed, Jul 10, 2024 at 09:12:58AM +1000, NeilBrown wrote:
> On Thu, 04 Jul 2024, Christoph Hellwig wrote:
> > On Wed, Jul 03, 2024 at 09:29:00PM +1000, NeilBrown wrote:
> > > I know nothing of this stance.  Do you have a reference?
> > 
> > No particular one.
> > 
> > > I have put a modest amount of work into ensure NFS to a server on the
> > > same machine works and last I checked it did - though I'm more
> > > confident of NFSv3 than NFSv4 because of the state manager thread.
> > 
> > How do you propagate the NOFS flag (and NOIO for a loop device) to
> > the server an the workqueues run by the server and the file system
> > call by it?  How do you ensure WQ_MEM_RECLAIM gets propagate to
> > all workqueues that could be called by the file system on the
> > server (the problem kicking off this discussion)?
> > 
> 
> Do we need to propagate these?
> 
> NOFS is for deadlock avoidance.  A filesystem "backend" (Dave's term - I
> think for the parts of the fs that handle write-back) might allocate
> memory, that might block waiting for memory reclaim, memory reclaim
> might re-enter the filesystem backend and might block on a lock (or
> similar) held while allocating memory.  NOFS breaks that deadlock.
>
> The important thing here isn't the NOFS flag, it is breaking any
> possible deadlock.

NOFS doesn't "break" any deadlocks. It simply prevents recursion
from one filesystem context to another. We don't have to use
NOFS if recursion is safe and won't deadlock.

That is, it may be safe for a filesystem to use GFP_KERNEL
allocations in it's writeback path. If the filesystem doesn't
implement ->writepage (like most of the major filesystems these
days) there is no way for memory reclaim to recurse back into the fs
writeback path. Hence GFP_NOFS is not needed in writeback context to
prevent reclaim recursion back into the filesystem writeback
path....

And the superblock shrinkers can't deadlock - they are non-blocking
and only act on unreferenced inodes. Hence any code that has a
locked inode is either evicting an unreferenced inode or holds a
reference to the inode. If we are doing an allocation with eithe rof
those sorts of inodes locked, there is no way that memory reclaim
recursion can trip over the locked inode and deadlock.

So between the removal of ->writepage, non-blocking shrinkers, and
scoped NOIO context for loop devices, I'm not sure there are any
generic reclaim recursion paths that can actually deadlock. i.e.
GFP_NOFS really only needs to be used if the filesystem itself
cannot safely recurse back into itself.

> Layered filesystems introduce a new complexity.

Nothing new about layered filesystems - we've been doing this for
decades...

> The backend for one
> filesystem can call into the front end of another filesystem.  That
> front-end is not required to use NOFS and even if we impose
> PF_MEMALLOC_NOFS, the front-end might wait for some work-queue action
> which doesn't inherit the NOFS flag.
> 
> But this doesn't necessarily matter.  Calling into the filesystem is not
> the problem - blocking waiting for a reply is the problem.  It is
> blocking that creates deadlocks.  So if the backend of one filesystem
> queues to a separate thread the work for the front end of the other
> filesystem and doesn't wait for the work to complete, then a deadlock
> cannot be introduced.
>
> /dev/loop uses the loop%d workqueue for this.  loop-back NFS hands the
> front-end work over to nfsd.  The proposed localio implementation uses a
> nfslocaliod workqueue for exactly the same task.  These remove the
> possibility of deadlock and mean that there is no need to pass NOFS
> through to the front-end of the backing filesystem.

I think this logic is fundamentally flawed.

Pushing IO submission to a separate
thread context which runs them in GFP_KERNEL context does not help
if the deadlock occurs during IO submission. With loop devices,
there's a "global" lock in the lower filesystem on the loop device -
the image file inode lock.

The IO issued by the loop device will -always- hit the same inode
and the same inode locks. Hence if we do memory allocation with an
inode lock held exclusive in the lower filesystem (e.g. a page cache
folio for a buffered write), we cannot allow memory reclaim during
any allocation with the image file inode locked to recurse into the
upper filesystem. If the upper filesystem then performs an operation
that requires IO to be submitted and completed to make progress
then we have a deadlock condition due to recursion from
the lower to upper filesystem regardless of the fact that the
lower IO submission is run from a different task.

Hence the loop device sets up the backing file mapping as:

	lo->lo_backing_file = file;
        lo->old_gfp_mask = mapping_gfp_mask(mapping);
        mapping_set_gfp_mask(mapping, lo->old_gfp_mask & ~(__GFP_IO|__GFP_FS));

GFP_NOIO context. It also sets up worker task context as:

	current->flags |= PF_LOCAL_THROTTLE | PF_MEMALLOC_NOIO;

GFP_NOIO context. IOWs, all allocation in the IO submission path is
explicitly GFP_NOIO to prevent any sort of reclaim recursion into
filesysetms or the block layer. That's the only sane thing to do,
because multi-filesystem deadlocks are an utter PITA to triage and
solve...

Keep in mind that PF_LOCAL_THROTTLE also prevents IO submission
deadlocks in the lower filesystem.  If the lower filesystem IO
submission dirties pages (i.e. buffered writes) it can get throttled
on the dirty page threshold. If it get's throttled like this trying
to clean dirty pages from the upper filesystem we have a deadlock.
The localio submission task will need to prevent that deadlock,
too.

IOWs, just moving IO submission to another thread does not avoid
the possibility of lower-to-upper filesystem recursion or lower
filesystem dirty page throttling deadlocks.

> Note that there is a separate question concerning pageout to a swap
> file.  pageout needs more than just deadlock avoidance.  It needs
> guaranteed progress in low memory conditions.   It needs PF_MEMALLOC (or
> mempools) and that cannot be finessed using work queues.  I don't think
> that Linux is able to support pageout through layered filesystems.

I don't think we ever want to go there.

-Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx