Re: [PATCH] nfsd: don't use GFP_KERNEL from nfsd_getxattr()/nfsd_listxattr()

Dave Chinner <david@xxxxxxxxxxxxx> · Mon, 17 Apr 2023 09:37:58 +1000

On Sun, Apr 16, 2023 at 07:51:41AM -0400, Jeff Layton wrote:
> On Sun, 2023-04-16 at 08:21 +0900, Tetsuo Handa wrote:
> > On 2023/04/16 3:40, Jeff Layton wrote:
> > > On Sun, 2023-04-16 at 02:11 +0900, Tetsuo Handa wrote:
> > > > On 2023/04/16 1:13, Chuck Lever III wrote:
> > > > > > On Apr 15, 2023, at 7:07 AM, Tetsuo Handa <penguin-kernel@xxxxxxxxxxxxxxxxxxx> wrote:
> > > > > > 
> > > > > > Since GFP_KERNEL is GFP_NOFS | __GFP_FS, usage like GFP_KERNEL | GFP_NOFS
> > > > > > does not make sense. Drop __GFP_FS flag in order to avoid deadlock.
> > > > > 
> > > > > The server side threads run in process context. GFP_KERNEL
> > > > > is safe to use here -- as Jeff said, this code is not in
> > > > > the server's reclaim path. Plenty of other call sites in
> > > > > the NFS server code use GFP_KERNEL.
> > > > 
> > > > GFP_KERNEL memory allocation calls filesystem's shrinker functions
> > > > because of __GFP_FS flag. My understanding is
> > > > 
> > > >   Whether this code is in memory reclaim path or not is irrelevant.
> > > >   Whether memory reclaim path might hold lock or not is relevant.
> > > > 
> > > > . Therefore, question is, does nfsd hold i_rwsem during memory reclaim path?
> > > > 
> > > 
> > > No. At the time of these allocations, the i_rwsem is not held.
> > 
> > Excuse me? nfsd_getxattr()/nfsd_listxattr() _are_ holding i_rwsem
> > via inode_lock_shared(inode) before kvmalloc(GFP_KERNEL | GFP_NOFS) allocation.
> > That's why
> > 
> > 	/*
> > 	 * We're holding i_rwsem - use GFP_NOFS.
> > 	 */
> > 
> > is explicitly there in nfsd_listxattr() side.

You can do GFP_KERNEL allocations holding the i_rwsem just fine.
All that it requires is the caller holds a reference to the inode,
and at that point inode will should skip the given inode without
every locking it.

Of course, lockdep can't handle the "referenced inode lock ->
fsreclaim -> unreferenced inode lock" pattern at all. It throws out
false positives when it detects this because it's not aware of the
fact that reference counts prevent inode lock recursion based
deadlocks in the vfs inode cache shrinker.

If a custom, non-vfs shrinker is walking inodes that have no
references and taking i_rwsem in a way that can block without first
checking whether it is safe to lock the inode in a deadlock free
manner, they are doing the wrong thing and the custom shrinker needs
to be fixed.

> > 
> > If memory reclaim path (directly or indirectly via locking dependency) involves
> > inode_lock_shared(inode)/inode_lock(inode), it is not safe to use __GFP_FS flag.
> > 
> 
> (cc'ing Frank V. who wrote this code and -fsdevel)
> 
> I stand corrected! You're absolutely right that it's taking the i_rwsem
> for read there. That seems pretty weird, actually. I don't believe we
> need to hold the inode_lock to call vfs_getxattr or vfs_listxattr, and
> certainly nothing else under there requires it.
> 
> Frank, was there some reason you decided you needed the inode_lock
> there? It looks like under the hood, the xattr code requires you to take
> it for write in setxattr and removexattr, but you don't need it at all
> in getxattr or listxattr. Go figure.

IIRC, the filesytsem can't take the i_rwsem for get/listxattr
because the lookup contexts may already hold the i_rwsem. I think
this is largely a problem caused by LSMs (e.g. IMA) needing to
access security xattrs in paths where the inode is already
locked.

> Longer term, I wonder what the inode_lock is protecting in setxattr and
> removexattr operations, given that get and list don't require them?
> These are always delegated to the filesystem driver -- there is no
> generic xattr implementation.

Serialising updates against each other. xattr modifications are
supposed to be "atomic" from the perspective of the user - they see
the entire old or the new xattr, never a partial value.

FWIW, XFS uses it's internal metadata rwsem for access/update
serialisation of xattrs because we can't rely on the i_rwsem for
reliable serialisation. I'm guessing that most journalling
filesystems do something similar.

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx