On Wed, Apr 27, 2022 at 11:32:45AM +0900, Namjae Jeon wrote: > Al pointed out that ksmbd has racy issue from using ->d_parent and ->d_name > in ksmbd_vfs_unlink and smb2_vfs_rename(). and use new lock_rename_child() > to lock stable parent while underlying rename racy. > Introduce vfs_path_parent_lookup helper to avoid out of share access and > export vfs functions like the following ones to use > vfs_path_parent_lookup(). > - export __lookup_hash(). > - export getname_kernel() and putname(). > > vfs_path_parent_lookup() is used for parent lookup of destination file > using absolute pathname given from FILE_RENAME_INFORMATION request. First of all, this is seriously broken: > -int ksmbd_vfs_lock_parent(struct user_namespace *user_ns, struct dentry *parent, > - struct dentry *child) > +struct dentry *ksmbd_vfs_lock_parent(struct dentry *child) > { > - struct dentry *dentry; > - int ret = 0; > + struct dentry *parent; > > + parent = dget(child->d_parent); > inode_lock_nested(d_inode(parent), I_MUTEX_PARENT); Do that in parallel with host rename() and you are risking this: you: fetch child->d_parent get preempted away another thread: move child elsewhere another thread: drop (the last) reference to old parent memory pressure evicts that dentry and reuses memory you: regain the timeslice and bump what you think is parent->d_count. In reality, it's 4 bytes in completely different data structure. At that point you already have a memory corruptor. Worse, there's a decent chance that subsequent code will revert the corruption, so it would be hell to debug - you need a race to reproduce the thing in the first place *and* you need something else to notice the temporary memory corruption. you: fetch what you think is ->d_inode of that dentry. It actually isn't anything of that sort. you: grab rwsem at hell knows what address (might or might not point to an rwsem). Here's another chance to get something reproducible - e.g. if what you thought was ->d_inode actually points to unmapped memory, you'll get an oops here. Won't be consistent, though. > + if (child->d_parent != parent) { you: ->d_parent doesn't point there anymore > + dput(parent); you: decrement those 4 bytes in whatever object it is; if you are lucky, it won't hit zero and nobody had noticed the temporary increment. If you are not, well... > + inode_unlock(d_inode(parent)); you: fetch ->d_inode of parent (mind you, it's a bug in its own right - even if parent hadn't gotten freed before your dget(), after dput() above it's fair game for getting freed; placing that dput() before unlocking d_inode() is wrong). Assuming you've got the same pointer as the first time around, you proceed to drop rwsem at the same address where you've grabbed it. IOW, you really don't want that in the tree in this form. It *might* be partially recoverable if you replace the first dget() with dget_parent() and reorder dput() and inode_unlock() in failure case, but... some of the callers of that thing are also rather dubious. Look: you have smb2_open() calling ksmbd_vfs_may_delete(), which calls that thing. Downstream of this: if (!file_present) { ... } else if (!already_permitted) { If the parent is *NOT* already locked by that point, just how much is your 'file_present' worth? And if it is, you'd obviously deadlock right there and then... I'm not sure I like what you've done with added exports - e.g. __lookup_hash had been OK as a name of static function, but exporting it is asking for clashes. And honestly, what would you say when running into a name like that? OK, it sounds like it's a (probably low-level) lookup in some hash table. _Maybe_ it would've been fine if we had one and only implementation of hash tables in the entire tree and that had been a part of it, but it's nothing of that sort. And "hash" in the name is not about doing a hash lookup as opposed to some other work (it *does* handle hash misses, allocating dentry, asking filesystem to do real on-disk lookup, etc.) - it's actually about "hash function of the name is already calculated". My fault, that - predecessor of that thing had been called lookup_one(); it took a string, calculated its length and computed hash, then proceeded to do lookups. The latter part could be reused in handling of rmdir et.al., where we already had the component length and hash precalculated, so the tail of lookup_one() had been carved out into a separate helper. Circa 2.3.99... Anyway, the name is _not_ fit for an export; I'm not sure what to call it - lookup_one_qstr(), perhaps? Additional fun is due to the fact that these days it is slightly different from the lookup_one() et.al. Those can be called with directory held shared; that allows parallel lookups, but it's not free of cost - if we run into a cache miss and need to allocate a new dentry and talk to filesystem, we have to recheck the hash table after allocation. __lookup_hash() is called only with parent held exclusive and it can skip that fun - hash miss is going to remain a miss; nobody else will be able to insert stuff into dcache in that directory until we unlock it. What I'm worried about is that renaming it to lookup_one_qstr() will be an invitation for "oh, we happen to have hash/len already known by the time of that lookup_one() call; let's just convert it to lookup_one_qsrt()" and if that happens in a place where the parent is held only shared, we'll be in trouble. OTOH, lookup_one_qstr_excl() sounds like an invitation to do something painful to whoever's responsible for such name... Suggestions, anyone?