> On Apr 16, 2023, at 7:37 PM, Dave Chinner <david@xxxxxxxxxxxxx> wrote: > > On Sun, Apr 16, 2023 at 07:51:41AM -0400, Jeff Layton wrote: >> On Sun, 2023-04-16 at 08:21 +0900, Tetsuo Handa wrote: >>> On 2023/04/16 3:40, Jeff Layton wrote: >>>> On Sun, 2023-04-16 at 02:11 +0900, Tetsuo Handa wrote: >>>>> On 2023/04/16 1:13, Chuck Lever III wrote: >>>>>>> On Apr 15, 2023, at 7:07 AM, Tetsuo Handa <penguin-kernel@xxxxxxxxxxxxxxxxxxx> wrote: >>>>>>> >>>>>>> Since GFP_KERNEL is GFP_NOFS | __GFP_FS, usage like GFP_KERNEL | GFP_NOFS >>>>>>> does not make sense. Drop __GFP_FS flag in order to avoid deadlock. >>>>>> >>>>>> The server side threads run in process context. GFP_KERNEL >>>>>> is safe to use here -- as Jeff said, this code is not in >>>>>> the server's reclaim path. Plenty of other call sites in >>>>>> the NFS server code use GFP_KERNEL. >>>>> >>>>> GFP_KERNEL memory allocation calls filesystem's shrinker functions >>>>> because of __GFP_FS flag. My understanding is >>>>> >>>>> Whether this code is in memory reclaim path or not is irrelevant. >>>>> Whether memory reclaim path might hold lock or not is relevant. >>>>> >>>>> . Therefore, question is, does nfsd hold i_rwsem during memory reclaim path? >>>>> >>>> >>>> No. At the time of these allocations, the i_rwsem is not held. >>> >>> Excuse me? nfsd_getxattr()/nfsd_listxattr() _are_ holding i_rwsem >>> via inode_lock_shared(inode) before kvmalloc(GFP_KERNEL | GFP_NOFS) allocation. >>> That's why >>> >>> /* >>> * We're holding i_rwsem - use GFP_NOFS. >>> */ >>> >>> is explicitly there in nfsd_listxattr() side. > > You can do GFP_KERNEL allocations holding the i_rwsem just fine. > All that it requires is the caller holds a reference to the inode, > and at that point inode will should skip the given inode without > every locking it. This suggests that the fix is to replace "GFP_KERNEL | GFP_NOFS" with "GFP_KERNEL" /and/ ensure those paths are holding an appropriate inode reference. I'd rather not weaken the retry semantics of the memory allocation if we do not have to do so. > Of course, lockdep can't handle the "referenced inode lock -> > fsreclaim -> unreferenced inode lock" pattern at all. It throws out > false positives when it detects this because it's not aware of the > fact that reference counts prevent inode lock recursion based > deadlocks in the vfs inode cache shrinker. > > If a custom, non-vfs shrinker is walking inodes that have no > references and taking i_rwsem in a way that can block without first > checking whether it is safe to lock the inode in a deadlock free > manner, they are doing the wrong thing and the custom shrinker needs > to be fixed. > >>> >>> If memory reclaim path (directly or indirectly via locking dependency) involves >>> inode_lock_shared(inode)/inode_lock(inode), it is not safe to use __GFP_FS flag. >>> >> >> (cc'ing Frank V. who wrote this code and -fsdevel) >> >> I stand corrected! You're absolutely right that it's taking the i_rwsem >> for read there. That seems pretty weird, actually. I don't believe we >> need to hold the inode_lock to call vfs_getxattr or vfs_listxattr, and >> certainly nothing else under there requires it. >> >> Frank, was there some reason you decided you needed the inode_lock >> there? It looks like under the hood, the xattr code requires you to take >> it for write in setxattr and removexattr, but you don't need it at all >> in getxattr or listxattr. Go figure. > > IIRC, the filesytsem can't take the i_rwsem for get/listxattr > because the lookup contexts may already hold the i_rwsem. I think > this is largely a problem caused by LSMs (e.g. IMA) needing to > access security xattrs in paths where the inode is already > locked. > >> Longer term, I wonder what the inode_lock is protecting in setxattr and >> removexattr operations, given that get and list don't require them? >> These are always delegated to the filesystem driver -- there is no >> generic xattr implementation. > > Serialising updates against each other. xattr modifications are > supposed to be "atomic" from the perspective of the user - they see > the entire old or the new xattr, never a partial value. > > FWIW, XFS uses it's internal metadata rwsem for access/update > serialisation of xattrs because we can't rely on the i_rwsem for > reliable serialisation. I'm guessing that most journalling > filesystems do something similar. > > Cheers, > > Dave. > -- > Dave Chinner > david@xxxxxxxxxxxxx -- Chuck Lever