Removal of KM_NOFS

Matthew Wilcox <willy@xxxxxxxxxxxxx> · Fri, 29 Sep 2023 23:18:19 +0100

I had a long plane ride yesterday, and I started on "Removing GFP_NOFS".
TLDR: I don't know enough about XFS to do this first step.  There are
various options and none of them are "obviously" the right thing to do.

The overall intent is to get rid of the __GFP_FS flag entirely; make
GFP_NOFS the same as GFP_KERNEL (and a later patch could rename all
the uses of GFP_NOFS to GFP_KERNEL).  That is, the only way to prevent
the memory allocator from entering fs reclaim would be by calling
memalloc_nofs_save().

XFS already calls memalloc_nofs_save() when starting a transaction.
This is almost certainly the right thing to do; many things which
could be freed through fs reclaim would need to start a transaction,
and we don't want to do a nested transaction.  But it turns out there
are several places that can't enter fs reclaim for other reasons.

Having boldly just removed __GFP_FS, I encountered problems (ie
lockdep got chatty) in XFS and now I don't think I know enough to
take on the prerequisite project of removing KM_NOFS.  While this is
obviously _possible_ (simply add calls to memalloc_nofs_save() and
memalloc_nofs_restore() around calls currently marked as KM_NOFS),
that's not really how the scoped API is supposed to be used.  Instead,
one is supposed to improve the understandability of the code by marking
the sections where, say, a lock is taken as now being unsafe to enter
fs reclaim because that lock is held.

The first one I got a bug report from was generic/270.  We take
dqp->q_qlock in fs reclaim, and there are code paths which take
dqp->q_qlock, then allocate memory.  There's a rather nasty extra
step where we take the dqp->q_qlock, then wait on a workqueue which is
going to call xlog_cil_push_work() which does the memory allocation.
Lockdep spots this transitive dependency, but we don't know to transfer
the nofs setting from the caller to the workqueue.

OK, fine, just add the memalloc_nofs_save() at the beginning of
xlog_cil_push_work() and restore it at the three exits; problem solved
in a moderately hacky way; but it's better than before since the KM_NOFS
calls are now safe to remove from the CIL machinery.  There are two ways
to solve this properly; one is to transfer the nofs setting from caller
to work queue, and also set the nofs setting whenever we take the dqlock.
The other would be to trylock (and back out properly if held) if we're
called in the reclaim path.  I don't know how hard that would be.

Skipping the second and third ones, the fourth one I've encountered
looks like this: xfs_buf_get_map() is allocating memory.  call path:

kmem_alloc+0x6f/0x170
xfs_buf_get_map+0x761/0x1140
xfs_buf_read_map+0x38/0x250
xfs_trans_read_buf_map+0x19c/0x520
xfs_btree_read_buf_block.constprop.0+0x7a/0xb0
xfs_btree_lookup_get_block+0x82/0x140
xfs_btree_lookup+0xaf/0x490
xfs_refcount_lookup_le+0x6a/0xd0
xfs_refcount_find_shared+0x6c/0x420
xfs_reflink_find_shared+0x67/0xa0
xfs_reflink_trim_around_shared+0xd7/0x1a0
xfs_bmap_trim_cow+0x3a/0x40
xfs_buffered_write_iomap_begin+0x2ce/0xbf0

That potentially deadlocks against

-> #0 (&xfs_nondir_ilock_class#3){++++}-{3:3}:
       __lock_acquire+0x148e/0x26d0
       lock_acquire+0xb8/0x280
       down_write_nested+0x3f/0xe0
       xfs_ilock+0xe3/0x260
       xfs_icwalk_ag+0x68c/0xa50
       xfs_icwalk+0x3e/0xa0
       xfs_reclaim_inodes_nr+0x7c/0xa0
       xfs_fs_free_cached_objects+0x14/0x20
       super_cache_scan+0x17d/0x1c0
       do_shrink_slab+0x16a/0x680
       shrink_slab+0x52a/0x8a0
       shrink_node+0x308/0x7a0
       balance_pgdat+0x28d/0x710

Annoyingly, lockdep doesn't tell me which acquisition of
fs_nondir_ilock_class#3 the first backtrace did.

We could pop the nofs setting anywhere in this call chain, but _really_
what we should be doing is calling memalloc_nofs_save() when we take
the xfs_nondir_ilock_class#3.  But ... there are a lot of places we
take the ilock, and it's kind of a big deal to add memalloc_nofs_save()
calls to all of them.  And then I looked at _why_ we take the lock, and
it's kind of stupid; we're just waiting for other callers to free it.
ie xfs_reclaim_inode() does:

       if (!xfs_ilock_nowait(ip, XFS_ILOCK_EXCL))
                goto out;
...
        xfs_iunlock(ip, XFS_ILOCK_EXCL);
...
        if (!radix_tree_delete(&pag->pag_ici_root,
                                XFS_INO_TO_AGINO(ip->i_mount, ino)))
...
        xfs_ilock(ip, XFS_ILOCK_EXCL);

ie we did the trylock, and it succeeded.  We know we don't have the
lock in process context.  It feels like we could legitimately use
xfs_lock_inumorder() to use a different locking class to do this wait.

But all of this has shown me how complex this project is.  I have
no desire to send patches for review that are "obviously wrong" (to
someone with more extensive knowledge of XFS) and just suck up reviewer
bandwidth for a cleanup that is, perhaps, of limited value.  If someone
more junior wants to take this on as a project to learn more about XFS,
I'll happily help where I can, but I think my time is perhaps better
spent on other projects for now.