On Fri, Nov 12, 2021 at 08:23:24AM +0100, Miklos Szeredi wrote: > On Fri, 12 Nov 2021 at 01:32, Dave Chinner <david@xxxxxxxxxxxxx> wrote: > > > > On Thu, Nov 11, 2021 at 11:39:30AM +0800, Ian Kent wrote: > > > When following a trailing symlink in rcu-walk mode it's possible to > > > succeed in getting the ->get_link() method pointer but the link path > > > string be deallocated while it's being used. > > > > > > Utilize the rcu mechanism to mitigate this risk. > > > > > > Suggested-by: Miklos Szeredi <miklos@xxxxxxxxxx> > > > Signed-off-by: Ian Kent <raven@xxxxxxxxxx> > > > --- > > > fs/xfs/kmem.h | 4 ++++ > > > fs/xfs/xfs_inode.c | 4 ++-- > > > fs/xfs/xfs_iops.c | 10 ++++++++-- > > > 3 files changed, 14 insertions(+), 4 deletions(-) > > > > > > diff --git a/fs/xfs/kmem.h b/fs/xfs/kmem.h > > > index 54da6d717a06..c1bd1103b340 100644 > > > --- a/fs/xfs/kmem.h > > > +++ b/fs/xfs/kmem.h > > > @@ -61,6 +61,10 @@ static inline void kmem_free(const void *ptr) > > > { > > > kvfree(ptr); > > > } > > > +static inline void kmem_free_rcu(const void *ptr) > > > +{ > > > + kvfree_rcu(ptr); > > > +} > > > > > > > > > static inline void * > > > diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c > > > index a4f6f034fb81..aaa1911e61ed 100644 > > > --- a/fs/xfs/xfs_inode.c > > > +++ b/fs/xfs/xfs_inode.c > > > @@ -2650,8 +2650,8 @@ xfs_ifree( > > > * already been freed by xfs_attr_inactive. > > > */ > > > if (ip->i_df.if_format == XFS_DINODE_FMT_LOCAL) { > > > - kmem_free(ip->i_df.if_u1.if_data); > > > - ip->i_df.if_u1.if_data = NULL; > > > + kmem_free_rcu(ip->i_df.if_u1.if_data); > > > + RCU_INIT_POINTER(ip->i_df.if_u1.if_data, NULL); > > > ip->i_df.if_bytes = 0; > > > } > > > > How do we get here in a way that the VFS will walk into this inode > > during a lookup? > > > > I mean, the dentry has to be validated and held during the RCU path > > walk, so if we are running a transaction to mark the inode as free > > here it has already been unlinked and the dentry turned > > negative. So anything that is doing a lockless pathwalk onto that > > dentry *should* see that it is a negative dentry at this point and > > hence nothing should be walking any further or trying to access the > > link that was shared from ->get_link(). > > > > AFAICT, that's what the sequence check bug you fixed in the previous > > patch guarantees. It makes no difference if the unlinked inode has > > been recycled or not, the lookup race condition is the same in that > > the inode has gone through ->destroy_inode and is now owned by the > > filesystem and not the VFS. > > Yes, the concern here is that without locking all the above can > theoretically happen between the sequence number check and if_data > being dereferenced. It would be good to describe the race condition in the commit message, because it's not at all obvious to readers how this race condition is triggered. > > Otherwise, it might just be best to memset the buffer to zero here > > rather than free it, and leave it to be freed when the inode is > > freed from the RCU callback in xfs_inode_free_callback() as per > > normal. Just as a FYI ext4_evict_inode() does exactly this with it's inline symlink data buffer it passes to ->get_link() when it is freeing the inode when the last reference to an unlinked inode goes away: /* * Set inode->i_size to 0 before calling ext4_truncate(). We need * special handling of symlinks here because i_size is used to * determine whether ext4_inode_info->i_data contains symlink data or * block mappings. Setting i_size to 0 will remove its fast symlink * status. Erase i_data so that it becomes a valid empty block map. */ if (ext4_inode_is_fast_symlink(inode)) memset(EXT4_I(inode)->i_data, 0, sizeof(EXT4_I(inode)->i_data)); inode->i_size = 0; IOWs, if the pointer returned from ->get_link() on an ext4 filesystem is accessed by the VFS after ->evict() is called, then it sees an empty buffer. IOWs, it's just not safe for the VFS to access the inode's link buffer pointer the moment the last reference to an unlinked inode goes away because it's contents are no longer guaranteed to be valid. I note that ext4 then immediately sets the VFS inode size to 0 length, indicating the link is no longer valid. It may well be that XFS is not setting the VFS inode size to zero in this case (I don't think the generic evict() path does that) so perhaps that's a guard that could be used at the VFS level to avoid the race condition... > My suggestion was to use .free_inode instead of .destroy_inode, the > former always being called after an RCU grace period. Which doesn't address the ext4 "zero the buffer in .evict", either. And for XFS, it means we then likely need to add synchronise_rcu() calls wherever we need to wait for inodes to be inactivated and/or reclaimed because they won't even get queued for inactivation until the current grace period expires. That's a potential locking nightmare because inode inactivation and reclaim uses rcu protected lockless algorithms to access XFS inodes without taking reference counts.... I just can't see how this race condition is XFS specific and why fixing it requires XFS to sepcifically handle it while we ignore similar theoretical issues in other filesystems... Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx