Re: [RFC][PATCHES] fixes in methods exposed to rcu pathwalk

Al Viro <viro@xxxxxxxxxxxxxxxxxx> · Mon, 2 Oct 2023 03:28:46 +0100



	Patchset summary follows; patches themselves - in followups.

1/15) rcu pathwalk: prevent bogus hard errors from may_lookup()
	That's one in the core pathwalk logics; basically, treat
errors from RCU permission() calls as "trust hard errors only if
the object is still there".  More for the sake of ->permission()
instances (similar to ->d_compare() et.al. the only thing they
need to care about when called on inode that is getting torn
down is not to crash; specific error value doesn't matter),
but there's also a narrow race where RCU pathwalk yields EACCES
while refwalk would fail with ENOTDIR.

2/15) exfat: move freeing sbi, upcase table and dropping nls into rcu-delayed helper
	Several UAF in their ->d_hash/->d_compare

3/15) affs: free affs_sb_info with kfree_rcu()
	UAF in ->d_hash/->d_compare

4/15) hfsplus: switch to rcu-delayed unloading of nls and freeing ->s_fs_info
	UAF and oops in ->d_hash/->d_compare

5/15) cifs_get_link(): bail out in unsafe case
	GFP_KERNEL allocation under rcu_read_lock in ->get_link()

6/15) procfs: move dropping pde and pid from ->evict_inode() to ->free_inode()
7/15) procfs: make freeing proc_fs_info rcu-delayed
	UAF in ->d_revalidate/->permission

8/15) gfs2: fix an oops in gfs2_permission()
	What it says...

9/15) nfs: make nfs_set_verifier() safe for use in RCU pathwalk
	race and oops in nfs_set_verifier() when used in RCU mode.
	Short version of the story - holding ->d_lock does *not*
	guarantee that ->d_parent->d_inode is non-NULL, unless you
	have your dentry pinned.

10/15) nfs: fix UAF on pathwalk running into umount
	Several UAF in ->d_revalidate/->permission/->get_link

11/15) fuse: fix UAF in rcu pathwalks
	UAF in ->permission/->get_link; accesses ->s_fs_info->fc,
	and ->s_fs_info points to struct fuse_mount which gets
	freed synchronously by fuse_mount_destroy(), from the end of
	->kill_sb().  It proceeds to accessing ->fc, but that part
	is safe - ->fc freeing is done via kfree_rcu() (called via
	->fc->release()) after the refcount of ->fc drops to zero.
	That can't happen until the call of fuse_conn_put() (from
	fuse_mount_destroy() from ->kill_sb()), so anything rcu-delayed
	from there won't get freed until the end of rcu pathwalk.
		
	Unfortunately, we also dereference fc->user_ns (pass it
	to current_in_userns).  That gets dropped via put_user_ns()
	(non-rcu-delayed) from the final fuse_conn_put() and it
	needs to be delayed.

	Possible solution: make freeing in ->release() synchronous
	and do call_rcu(delayed_release, &fc->rcu), with
	delayed_release() doing put_user_ns() and calling
	->release().
	This one is relatively intrusive and I'd really like comments
	from Miklos...

12/15) afs: fix __afs_break_callback() / afs_drop_open_mmap() race
	->d_revalidate/->permission manage to step on this one.
	We might end up doing __afs_break_callback() there,
	and it could race with afs_drop_open_mmap(), leading to
	stray queued work on object that might be about to be
	freed, with nothing to flush or cancel the sucker.
	To fix that race, make sure that afs_drop_open_mmap()
	will only do the final decrement while under ->cb_lock;
	since the entire __afs_break_callback() is done under
	the same, it will either see zero mmap count and do
	nothing, or it will finish queue_work() before afs_drop_open_mmap()
	gets to its flush_work().

13/15) overlayfs: move freeing ovl_entry past rcu delay
	Used to be done by kfree_rcu() from ->d_release(),
	got moved to ->destroy_inode() without any delays;
	move freeing to ->free_inode().

14/15) ovl_dentry_revalidate_common(): fetch inode once
	d_inode_rcu() is right - we might be in rcu pathwalk;
	however, OVL_E() hides plain d_inode() on the same dentry...
	
15/15) overlayfs: make use of ->layers safe in rcu pathwalk
	ovl_permission() accesses ->layers[...].mnt; we can't have ->layers
	freed without an RCU delay on fs shutdown.  Fortunately, kern_unmount_array()
	used to drop those mounts does include an RCU delay, so freeing is
	delayed; unfortunately, the array passed to kern_unmount_array() is
	formed by mangling ->layers contents and that happens without any
	delays.

	Use a separate array instead; local if we have a few layers,
	kmalloc'ed if there's a lot of them.  If allocation fails,
	fall back to kern_unmount() for individual mounts; it's
	not a fast path by any stretch of imagination.