On Fri, Nov 12, 2021 at 07:10:19AM +0800, Ian Kent wrote: > On Thu, 2021-11-11 at 11:08 -0500, Brian Foster wrote: > > Hi Brian, > > > On Thu, Nov 11, 2021 at 11:39:30AM +0800, Ian Kent wrote: > > > When following a trailing symlink in rcu-walk mode it's possible to > > > succeed in getting the ->get_link() method pointer but the link > > > path > > > string be deallocated while it's being used. > > > > > > Utilize the rcu mechanism to mitigate this risk. > > > > > > Suggested-by: Miklos Szeredi <miklos@xxxxxxxxxx> > > > Signed-off-by: Ian Kent <raven@xxxxxxxxxx> > > > --- > > > fs/xfs/kmem.h | 4 ++++ > > > fs/xfs/xfs_inode.c | 4 ++-- > > > fs/xfs/xfs_iops.c | 10 ++++++++-- > > > 3 files changed, 14 insertions(+), 4 deletions(-) > > > > > ... > > > diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c > > > index a607d6aca5c4..2977e19da7b7 100644 > > > --- a/fs/xfs/xfs_iops.c > > > +++ b/fs/xfs/xfs_iops.c > > > @@ -524,11 +524,17 @@ xfs_vn_get_link_inline( > > > > > > /* > > > * The VFS crashes on a NULL pointer, so return - > > > EFSCORRUPTED if > > > - * if_data is junk. > > > + * if_data is junk. Also, if the path walk is in rcu-walk > > > mode > > > + * and the inode link path has gone away due inode re-use > > > we have > > > + * no choice but to tell the VFS to redo the lookup. > > > */ > > > - link = ip->i_df.if_u1.if_data; > > > + link = rcu_dereference(ip->i_df.if_u1.if_data); > > > + if (!dentry && !link) > > > + return ERR_PTR(-ECHILD); > > > + > > > > One thing that concerns me slightly about this approach is that inode > > reuse does not necessarily guarantee that if_data is NULL. It seems > > technically just as possible (even if exceedingly unlikely) for link > > to > > point at newly allocated memory since the previous sequence count > > validation check. The inode could be reused as another inline symlink > > for example, though it's not clear to me if that is really a problem > > for > > the vfs (assuming a restart would just land on the new link > > anyways?). > > But the inode could also be reallocated as something like a shortform > > directory, which means passing directory header data or whatever that > > it > > stores in if_data back to pick_link(), which is then further > > processed > > as a string. > > This is the sort of feedback I was hoping for. > > This sounds related to the life-cycle of xfs inodes and re-use. > Hopefully someone here on the list can enlighten me on this. > > The thing that comes to mind is that the inode re-use would > need to occur between the VFS check that validates the inode > is still ok and the use of link string. I think that can still > go away even with the above check. > Yeah... The original NULL ->get_link() problem was replicated with a small delay in the lookup path (specifically in the symlink processing path). This essentially widens the race window and allows a separate task to invalidate the dentry between the time the last dentry sequence validation occurred (and passed) and the attempt to call ->get_link() becomes imminent. I think patch 1 largely addresses this issue because we'll have revalidated the previous read of the function pointer before we attempt to call it. That leads to this patch, which suggests that even after the validation fix a small race window still technically exists with the ->get_link() code and inode teardown. In fact, it's not that hard to show that this is true by modifying the original reproducer to push the delay out beyond the check added by patch 1 (or into the ->get_link() callback). Playing around with that a bit, it's possible to induce a ->get_link() call to an inode that was reallocated as a shortform directory and returns a non-NULL if_data fork of that dir back to the vfs (to be interpreted as a symlink string). Nothing seems to explode on my quick test, fortunately, but I don't think that's an interaction we want to maintain. Of course one caveat to all of that is that after patch 1, the race window for that one might be so small as to make this impossible to reproduce in practice (whereas the problem fixed by patch 1 has been reproduced by users)... > Hopefully someone can clarify what happens here. > > > > > With that, I wonder why we wouldn't just return -ECHILD here like we > > do > > for the non-inline case to address the immediate problem, and then > > perhaps separately consider if we can rework bits of the > > reuse/reclaim > > code to allow rcu lookup of inline symlinks under certain conditions. > > Always switching to ref-walk mode would certainly resolve the > problem too, yes, perhaps we have no choice ... > Oh I don't think it's the only choice. I think Miklos' suggestion to use ->free_inode() is probably the right general approach. I just think a switch to ref-walk mode might be a good incremental step to fix this problem in a backportable way (s_op->free_inode() is newer relative to the introduction of _get_link_inline()). We can always re-enable rcu symlink processing once we get our inode teardown/reuse bits fixed up accordingly.. Just my .02. Brian > Ian > > > > FWIW, I'm also a little curious why we don't set i_link for inline > > symlinks. I don't think that addresses this validation problem, but > > perhaps might allow rcu lookups in the inline symlink common case > > where > > things don't change during the lookup (and maybe even eliminate the > > need > > for this custom inline callback)..? > > > > Brian > > > > > if (XFS_IS_CORRUPT(ip->i_mount, !link)) > > > return ERR_PTR(-EFSCORRUPTED); > > > + > > > return link; > > > } > > > > > > > > > > > > >