On 9/3/19 6:29 PM, Gao Xiang wrote:
On Tue, Sep 03, 2019 at 05:28:26PM +0530, Ritesh Harjani wrote:
Hi Viro/All,
Could you please review below issue and it's proposed solutions.
If you could let me know which of the two you think will be a better
approach to solve this or in case if you have any other better approach, I
can prepare and submit a official patch with that.
Issue signature:-
[NIP : trailing_symlink+80]
[LR : trailing_symlink+1092]
#4 [c00000198069bb70] trailing_symlink at c0000000004bae60 (unreliable)
#5 [c00000198069bc00] path_openat at c0000000004bdd14
#6 [c00000198069bc90] do_filp_open at c0000000004c0274
#7 [c00000198069bdb0] do_sys_open at c00000000049b248
#8 [c00000198069be30] system_call at c00000000000b388
Test case:-
shell-1 - "while [ 1 ]; do cat /gpfs/g1/testdir/file3; sleep 1; done"
shell-2 - "while [ 1 ]; do ln -s /gpfs/g1/testdir/file1
/gpfs/g1/testdir/file3; sleep 1; rm /gpfs/g1/testdir/file3 sleep 1; done
Problem description:-
In some filesystems like GPFS below described scenario may happen on some
platforms (Reported-By:- wugyuan)
Here, two threads are being run in 2 different shells. Thread-1(cat) does
cat of the symlink and Thread-2(ln) is creating the symlink.
Now on any platform with GPFS like filesystem, if CPU does out-of-order
execution (or any kind of re-ordering due compiler optimization?) in
function __d_set_and_inode_type(), then we see a NULL pointer dereference
due to inode->i_uid.
This happens because in lookup_fast in nonRCU path or say REF-walk (i.e. in
else condition), we check d_is_negative() without any lock protection.
And since in __d_set_and_inode_type() re-ordering may happen in setting of
dentry->type & dentry->inode => this means that there is this tiny window
where things are going wrong.
(GPFS like):- Any FS with -inode_operations ->permission callback returning
-ECHILD in case of (mask & MAY_NOT_BLOCK) may cause this problem to happen.
(few e.g. found were - ocfs2, ceph, coda, afs)
int xxx_permission(struct inode *inode, int mask)
{
if (mask & MAY_NOT_BLOCK)
return -ECHILD;
<...>
}
Wugyuan(cc), could reproduce this problem with GPFS filesystem.
Since, I didn't have the GPFS setup, so I tried replicating on a native FS
by forcing out-of-order execution in function __d_set_inode_and_type() and
making sure we return -ECHILD in MAY_NOT_BLOCK case in ->permission
operation for all inodes.
With above changes in kernel, I could as well hit this issue on a native FS
too.
(basically what we observed is link_path_walk will do nonRCU(REF-walk)
lookup due to may_lookup -> inode_permission return -ECHILD and then
unlazy_walk drops the LOOKUP_RCU flag (nd->flag). After that below race is
possible).
Sequence of events:-
Thread-2(Comm: ln) Thread-1(Comm: cat)
dentry = __d_lookup() //nonRCU
__d_set_and_inode_type() (Out-of-order execution)
flags = READ_ONCE(dentry->d_flags);
flags &= ~(DCACHE_ENTRY_TYPE | DCACHE_FALLTHRU);
flags |= type_flags;
WRITE_ONCE(dentry->d_flags, flags);
if (unlikely(d_is_negative()) // fails
{}
// since type is already updated in
// Thread-2 in parallel but inode
// not yet set.
// d_is_negative returns false
*inode = d_backing_inode(path->dentry);
// means inode is still NULL
dentry->d_inode = inode;
trailing_symlink()
may_follow_link()
inode = nd->link_inode;
// nd->link_inode = NULL
//Then it crashes while
//doing inode->i_uid
It seems much similar to
https://lore.kernel.org/r/20190419084810.63732-1-houtao1@xxxxxxxxxx/
Thanks, yes two same symptoms with different use cases.
But except the fact that here, we see the issue with GPFS quite
frequently. So let's hope that we could have some solution to this
problem in upstream.
From the thread:-
>> We could simply use d_really_is_negative() there, avoiding all that
>> mess. If and when we get around to whiteouts-in-dcache (i.e. if
>> unionfs series gets resurrected), we can revisit that
I didn't get this part. Does it mean, d_really_is_negative can only be
used, once whiteouts-in-dcache series is resurrected?
If yes, meanwhile could we have any other solution in place?
-ritesh