Re: [PATCH 3/3] ksmbd: fix racy issue from using ->d_parent and ->d_name

Al Viro <viro@xxxxxxxxxxxxxxxxxx> · Tue, 17 May 2022 21:21:14 +0000

On Wed, Apr 27, 2022 at 11:32:45AM +0900, Namjae Jeon wrote:
> Al pointed out that ksmbd has racy issue from using ->d_parent and ->d_name
> in ksmbd_vfs_unlink and smb2_vfs_rename(). and use new lock_rename_child()
> to lock stable parent while underlying rename racy.
> Introduce vfs_path_parent_lookup helper to avoid out of share access and
> export vfs functions like the following ones to use
> vfs_path_parent_lookup().
>  - export __lookup_hash().
>  - export getname_kernel() and putname().
> 
> vfs_path_parent_lookup() is used for parent lookup of destination file
> using absolute pathname given from FILE_RENAME_INFORMATION request.

First of all, this is seriously broken:

> -int ksmbd_vfs_lock_parent(struct user_namespace *user_ns, struct dentry *parent,
> -			  struct dentry *child)
> +struct dentry *ksmbd_vfs_lock_parent(struct dentry *child)
>  {
> -	struct dentry *dentry;
> -	int ret = 0;
> +	struct dentry *parent;
>  
> +	parent = dget(child->d_parent);
>  	inode_lock_nested(d_inode(parent), I_MUTEX_PARENT);

Do that in parallel with host rename() and you are risking this:

you: fetch child->d_parent
	get preempted away
another thread: move child elsewhere
another thread: drop (the last) reference to old parent
	memory pressure evicts that dentry and reuses memory
you: regain the timeslice and bump what you think is parent->d_count.
	In reality, it's 4 bytes in completely different data structure.
	At that point you already have a memory corruptor.  Worse,
	there's a decent chance that subsequent code will revert the
	corruption, so it would be hell to debug - you need a race
	to reproduce the thing in the first place *and* you need
	something else to notice the temporary memory corruption.

you: fetch what you think is ->d_inode of that dentry.  It actually
	isn't anything of that sort.
you: grab rwsem at hell knows what address (might or might not point
	to an rwsem).  Here's another chance to get something
	reproducible - e.g. if what you thought was ->d_inode actually
	points to unmapped memory, you'll get an oops here.  Won't
	be consistent, though.

> +	if (child->d_parent != parent) {
you:	->d_parent doesn't point there anymore

> +		dput(parent);

you:	decrement those 4 bytes in whatever object it is; if you are
	lucky, it won't hit zero and nobody had noticed the temporary
	increment.  If you are not, well...

> +		inode_unlock(d_inode(parent));

you:	fetch ->d_inode of parent (mind you, it's a bug in its own right -
	even if parent hadn't gotten freed before your dget(), after dput()
	above it's fair game for getting freed; placing that dput()
	before unlocking d_inode() is wrong).  Assuming you've got
	the same pointer as the first time around, you proceed to
	drop rwsem at the same address where you've grabbed it.

IOW, you really don't want that in the tree in this form.

It *might* be partially recoverable if you replace the first dget() with
dget_parent() and reorder dput() and inode_unlock() in failure case, but...
some of the callers of that thing are also rather dubious.

Look: you have smb2_open() calling ksmbd_vfs_may_delete(), which calls
that thing.  Downstream of this:
	if (!file_present) {
	...
	} else if (!already_permitted) {

If the parent is *NOT* already locked by that point, just how much is
your 'file_present' worth?  And if it is, you'd obviously deadlock
right there and then...

I'm not sure I like what you've done with added exports - e.g.
__lookup_hash had been OK as a name of static function, but exporting
it is asking for clashes.  And honestly, what would you say when running
into a name like that?  OK, it sounds like it's a (probably low-level)
lookup in some hash table.  _Maybe_ it would've been fine if we had one
and only implementation of hash tables in the entire tree and that
had been a part of it, but it's nothing of that sort.  And "hash" in
the name is not about doing a hash lookup as opposed to some other
work (it *does* handle hash misses, allocating dentry, asking filesystem
to do real on-disk lookup, etc.) - it's actually about "hash function
of the name is already calculated".  My fault, that - predecessor of
that thing had been called lookup_one(); it took a string, calculated
its length and computed hash, then proceeded to do lookups.  The latter
part could be reused in handling of rmdir et.al., where we already had
the component length and hash precalculated, so the tail of lookup_one()
had been carved out into a separate helper.  Circa 2.3.99...

Anyway, the name is _not_ fit for an export; I'm not sure what to call
it - lookup_one_qstr(), perhaps?  Additional fun is due to the fact
that these days it is slightly different from the lookup_one() et.al.
Those can be called with directory held shared; that allows parallel
lookups, but it's not free of cost - if we run into a cache miss and need
to allocate a new dentry and talk to filesystem, we have to recheck the
hash table after allocation.  __lookup_hash() is called only with parent
held exclusive and it can skip that fun - hash miss is going to remain
a miss; nobody else will be able to insert stuff into dcache in that
directory until we unlock it.

What I'm worried about is that renaming it to lookup_one_qstr() will
be an invitation for "oh, we happen to have hash/len already known by the
time of that lookup_one() call; let's just convert it to lookup_one_qsrt()"
and if that happens in a place where the parent is held only shared, we'll
be in trouble.  OTOH, lookup_one_qstr_excl() sounds like an invitation to
do something painful to whoever's responsible for such name...

Suggestions, anyone?