Re: [PATCH] nfsd: don't use GFP_KERNEL from nfsd_getxattr()/nfsd_listxattr()

Chuck Lever III <chuck.lever@xxxxxxxxxx> · Wed, 19 Apr 2023 13:51:12 +0000

> On Apr 16, 2023, at 7:37 PM, Dave Chinner <david@xxxxxxxxxxxxx> wrote:
> 
> On Sun, Apr 16, 2023 at 07:51:41AM -0400, Jeff Layton wrote:
>> On Sun, 2023-04-16 at 08:21 +0900, Tetsuo Handa wrote:
>>> On 2023/04/16 3:40, Jeff Layton wrote:
>>>> On Sun, 2023-04-16 at 02:11 +0900, Tetsuo Handa wrote:
>>>>> On 2023/04/16 1:13, Chuck Lever III wrote:
>>>>>>> On Apr 15, 2023, at 7:07 AM, Tetsuo Handa <penguin-kernel@xxxxxxxxxxxxxxxxxxx> wrote:
>>>>>>> 
>>>>>>> Since GFP_KERNEL is GFP_NOFS | __GFP_FS, usage like GFP_KERNEL | GFP_NOFS
>>>>>>> does not make sense. Drop __GFP_FS flag in order to avoid deadlock.
>>>>>> 
>>>>>> The server side threads run in process context. GFP_KERNEL
>>>>>> is safe to use here -- as Jeff said, this code is not in
>>>>>> the server's reclaim path. Plenty of other call sites in
>>>>>> the NFS server code use GFP_KERNEL.
>>>>> 
>>>>> GFP_KERNEL memory allocation calls filesystem's shrinker functions
>>>>> because of __GFP_FS flag. My understanding is
>>>>> 
>>>>>  Whether this code is in memory reclaim path or not is irrelevant.
>>>>>  Whether memory reclaim path might hold lock or not is relevant.
>>>>> 
>>>>> . Therefore, question is, does nfsd hold i_rwsem during memory reclaim path?
>>>>> 
>>>> 
>>>> No. At the time of these allocations, the i_rwsem is not held.
>>> 
>>> Excuse me? nfsd_getxattr()/nfsd_listxattr() _are_ holding i_rwsem
>>> via inode_lock_shared(inode) before kvmalloc(GFP_KERNEL | GFP_NOFS) allocation.
>>> That's why
>>> 
>>> /*
>>>  * We're holding i_rwsem - use GFP_NOFS.
>>>  */
>>> 
>>> is explicitly there in nfsd_listxattr() side.
> 
> You can do GFP_KERNEL allocations holding the i_rwsem just fine.
> All that it requires is the caller holds a reference to the inode,
> and at that point inode will should skip the given inode without
> every locking it.

This suggests that the fix is to replace "GFP_KERNEL | GFP_NOFS"
with "GFP_KERNEL" /and/ ensure those paths are holding an
appropriate inode reference.

I'd rather not weaken the retry semantics of the memory
allocation if we do not have to do so.

> Of course, lockdep can't handle the "referenced inode lock ->
> fsreclaim -> unreferenced inode lock" pattern at all. It throws out
> false positives when it detects this because it's not aware of the
> fact that reference counts prevent inode lock recursion based
> deadlocks in the vfs inode cache shrinker.
> 
> If a custom, non-vfs shrinker is walking inodes that have no
> references and taking i_rwsem in a way that can block without first
> checking whether it is safe to lock the inode in a deadlock free
> manner, they are doing the wrong thing and the custom shrinker needs
> to be fixed.
> 
>>> 
>>> If memory reclaim path (directly or indirectly via locking dependency) involves
>>> inode_lock_shared(inode)/inode_lock(inode), it is not safe to use __GFP_FS flag.
>>> 
>> 
>> (cc'ing Frank V. who wrote this code and -fsdevel)
>> 
>> I stand corrected! You're absolutely right that it's taking the i_rwsem
>> for read there. That seems pretty weird, actually. I don't believe we
>> need to hold the inode_lock to call vfs_getxattr or vfs_listxattr, and
>> certainly nothing else under there requires it.
>> 
>> Frank, was there some reason you decided you needed the inode_lock
>> there? It looks like under the hood, the xattr code requires you to take
>> it for write in setxattr and removexattr, but you don't need it at all
>> in getxattr or listxattr. Go figure.
> 
> IIRC, the filesytsem can't take the i_rwsem for get/listxattr
> because the lookup contexts may already hold the i_rwsem. I think
> this is largely a problem caused by LSMs (e.g. IMA) needing to
> access security xattrs in paths where the inode is already
> locked.
> 
>> Longer term, I wonder what the inode_lock is protecting in setxattr and
>> removexattr operations, given that get and list don't require them?
>> These are always delegated to the filesystem driver -- there is no
>> generic xattr implementation.
> 
> Serialising updates against each other. xattr modifications are
> supposed to be "atomic" from the perspective of the user - they see
> the entire old or the new xattr, never a partial value.
> 
> FWIW, XFS uses it's internal metadata rwsem for access/update
> serialisation of xattrs because we can't rely on the i_rwsem for
> reliable serialisation. I'm guessing that most journalling
> filesystems do something similar.
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@xxxxxxxxxxxxx

--
Chuck Lever