On Tue, Sep 03, 2019 at 05:28:26PM +0530, Ritesh Harjani wrote: > Hi Viro/All, > > Could you please review below issue and it's proposed solutions. > If you could let me know which of the two you think will be a better > approach to solve this or in case if you have any other better approach, I > can prepare and submit a official patch with that. > > > > Issue signature:- > [NIP : trailing_symlink+80] > [LR : trailing_symlink+1092] > #4 [c00000198069bb70] trailing_symlink at c0000000004bae60 (unreliable) > #5 [c00000198069bc00] path_openat at c0000000004bdd14 > #6 [c00000198069bc90] do_filp_open at c0000000004c0274 > #7 [c00000198069bdb0] do_sys_open at c00000000049b248 > #8 [c00000198069be30] system_call at c00000000000b388 > > > > Test case:- > shell-1 - "while [ 1 ]; do cat /gpfs/g1/testdir/file3; sleep 1; done" > shell-2 - "while [ 1 ]; do ln -s /gpfs/g1/testdir/file1 > /gpfs/g1/testdir/file3; sleep 1; rm /gpfs/g1/testdir/file3 sleep 1; done > > > > Problem description:- > In some filesystems like GPFS below described scenario may happen on some > platforms (Reported-By:- wugyuan) > > Here, two threads are being run in 2 different shells. Thread-1(cat) does > cat of the symlink and Thread-2(ln) is creating the symlink. > > Now on any platform with GPFS like filesystem, if CPU does out-of-order > execution (or any kind of re-ordering due compiler optimization?) in > function __d_set_and_inode_type(), then we see a NULL pointer dereference > due to inode->i_uid. > > This happens because in lookup_fast in nonRCU path or say REF-walk (i.e. in > else condition), we check d_is_negative() without any lock protection. > And since in __d_set_and_inode_type() re-ordering may happen in setting of > dentry->type & dentry->inode => this means that there is this tiny window > where things are going wrong. > > > (GPFS like):- Any FS with -inode_operations ->permission callback returning > -ECHILD in case of (mask & MAY_NOT_BLOCK) may cause this problem to happen. > (few e.g. found were - ocfs2, ceph, coda, afs) > > int xxx_permission(struct inode *inode, int mask) > { > if (mask & MAY_NOT_BLOCK) > return -ECHILD; > <...> > } > > Wugyuan(cc), could reproduce this problem with GPFS filesystem. > Since, I didn't have the GPFS setup, so I tried replicating on a native FS > by forcing out-of-order execution in function __d_set_inode_and_type() and > making sure we return -ECHILD in MAY_NOT_BLOCK case in ->permission > operation for all inodes. > > With above changes in kernel, I could as well hit this issue on a native FS > too. > > (basically what we observed is link_path_walk will do nonRCU(REF-walk) > lookup due to may_lookup -> inode_permission return -ECHILD and then > unlazy_walk drops the LOOKUP_RCU flag (nd->flag). After that below race is > possible). > > > > Sequence of events:- > > Thread-2(Comm: ln) Thread-1(Comm: cat) > > dentry = __d_lookup() //nonRCU > > __d_set_and_inode_type() (Out-of-order execution) > flags = READ_ONCE(dentry->d_flags); > flags &= ~(DCACHE_ENTRY_TYPE | DCACHE_FALLTHRU); > flags |= type_flags; > WRITE_ONCE(dentry->d_flags, flags); > > > if (unlikely(d_is_negative()) // fails > {} > // since type is already updated in > // Thread-2 in parallel but inode > // not yet set. > // d_is_negative returns false > > *inode = d_backing_inode(path->dentry); > // means inode is still NULL > > dentry->d_inode = inode; > > trailing_symlink() > may_follow_link() > inode = nd->link_inode; > // nd->link_inode = NULL > //Then it crashes while > //doing inode->i_uid > > It seems much similar to https://lore.kernel.org/r/20190419084810.63732-1-houtao1@xxxxxxxxxx/ Thanks, Gao Xiang > > > > Approach-1:- using wmb() > > diff --git a/fs/dcache.c b/fs/dcache.c > index e88cf0554e65..966172df77ee 100644 > --- a/fs/dcache.c > +++ b/fs/dcache.c > @@ -316,6 +316,7 @@ static inline void __d_set_inode_and_type(struct dentry > *dentry, > unsigned flags; > > dentry->d_inode = inode; > + smp_wmb(); > flags = READ_ONCE(dentry->d_flags); > flags &= ~(DCACHE_ENTRY_TYPE | DCACHE_FALLTHRU); > flags |= type_flags; > > > > Approach-2:- using spin_lock(&dentry->d_lock); > > Do you think lock should be a better approach, given that we are already > in REF-walk mode. As per the Documentation, we should be able to take > spin_lock(&dentry->d_lock) in Ref-walk mode whenever required? > > > With smp_wmb(), if added, should add a small latency in both > RCU/REF-walk. But should be able to cover all the cases in general related > to dentry->type & dentry->inode ordering. Though, we don't have any other > race conditions reported or tested, other than the one, mentioned in this > email. > > Confused :( > > > > diff --git a/fs/namei.c b/fs/namei.c > index 209c51a5226c..a3145594da1c 100644 > --- a/fs/namei.c > +++ b/fs/namei.c > @@ -1557,6 +1557,7 @@ static int lookup_fast(struct nameidata *nd, > struct dentry *dentry, *parent = nd->path.dentry; > int status = 1; > int err; > + bool negative; > > /* > * Rename seqlock is not required here because in the off chance > @@ -1565,7 +1566,6 @@ static int lookup_fast(struct nameidata *nd, > */ > if (nd->flags & LOOKUP_RCU) { > unsigned seq; > - bool negative; > dentry = __d_lookup_rcu(parent, &nd->last, &seq); > if (unlikely(!dentry)) { > if (unlazy_walk(nd)) > @@ -1623,7 +1623,11 @@ static int lookup_fast(struct nameidata *nd, > dput(dentry); > return status; > } > - if (unlikely(d_is_negative(dentry))) { > + > + spin_lock(&dentry->d_lock); > + negative = d_is_negative(dentry); > + spin_unlock(&dentry->d_lock); > + if (unlikely(negative)) { > dput(dentry); > return -ENOENT; > } > > > Regards > Ritesh >