On Tue, May 17, 2022 at 09:21:14PM +0000, Al Viro wrote: > On Wed, Apr 27, 2022 at 11:32:45AM +0900, Namjae Jeon wrote: > > Al pointed out that ksmbd has racy issue from using ->d_parent and ->d_name > > in ksmbd_vfs_unlink and smb2_vfs_rename(). and use new lock_rename_child() > > to lock stable parent while underlying rename racy. > > Introduce vfs_path_parent_lookup helper to avoid out of share access and > > export vfs functions like the following ones to use > > vfs_path_parent_lookup(). > > - export __lookup_hash(). > > - export getname_kernel() and putname(). > > > > vfs_path_parent_lookup() is used for parent lookup of destination file > > using absolute pathname given from FILE_RENAME_INFORMATION request. > > First of all, this is seriously broken: > > > -int ksmbd_vfs_lock_parent(struct user_namespace *user_ns, struct dentry *parent, > > - struct dentry *child) > > +struct dentry *ksmbd_vfs_lock_parent(struct dentry *child) > > { > > - struct dentry *dentry; > > - int ret = 0; > > + struct dentry *parent; > > > > + parent = dget(child->d_parent); > > inode_lock_nested(d_inode(parent), I_MUTEX_PARENT); > > Do that in parallel with host rename() and you are risking this: > > you: fetch child->d_parent > get preempted away > another thread: move child elsewhere > another thread: drop (the last) reference to old parent > memory pressure evicts that dentry and reuses memory > you: regain the timeslice and bump what you think is parent->d_count. > In reality, it's 4 bytes in completely different data structure. > At that point you already have a memory corruptor. Worse, > there's a decent chance that subsequent code will revert the > corruption, so it would be hell to debug - you need a race > to reproduce the thing in the first place *and* you need > something else to notice the temporary memory corruption. > > you: fetch what you think is ->d_inode of that dentry. It actually > isn't anything of that sort. > you: grab rwsem at hell knows what address (might or might not point > to an rwsem). Here's another chance to get something > reproducible - e.g. if what you thought was ->d_inode actually > points to unmapped memory, you'll get an oops here. Won't > be consistent, though. > > > + if (child->d_parent != parent) { > you: ->d_parent doesn't point there anymore > > > + dput(parent); > > you: decrement those 4 bytes in whatever object it is; if you are > lucky, it won't hit zero and nobody had noticed the temporary > increment. If you are not, well... > > > + inode_unlock(d_inode(parent)); > > you: fetch ->d_inode of parent (mind you, it's a bug in its own right - > even if parent hadn't gotten freed before your dget(), after dput() > above it's fair game for getting freed; placing that dput() > before unlocking d_inode() is wrong). Assuming you've got > the same pointer as the first time around, you proceed to > drop rwsem at the same address where you've grabbed it. > > IOW, you really don't want that in the tree in this form. > > It *might* be partially recoverable if you replace the first dget() with > dget_parent() and reorder dput() and inode_unlock() in failure case, but... > some of the callers of that thing are also rather dubious. > > Look: you have smb2_open() calling ksmbd_vfs_may_delete(), which calls > that thing. Downstream of this: > if (!file_present) { > ... > } else if (!already_permitted) { > > If the parent is *NOT* already locked by that point, just how much is > your 'file_present' worth? And if it is, you'd obviously deadlock > right there and then... > > I'm not sure I like what you've done with added exports - e.g. > __lookup_hash had been OK as a name of static function, but exporting > it is asking for clashes. And honestly, what would you say when running > into a name like that? OK, it sounds like it's a (probably low-level) > lookup in some hash table. _Maybe_ it would've been fine if we had one > and only implementation of hash tables in the entire tree and that > had been a part of it, but it's nothing of that sort. And "hash" in > the name is not about doing a hash lookup as opposed to some other > work (it *does* handle hash misses, allocating dentry, asking filesystem > to do real on-disk lookup, etc.) - it's actually about "hash function > of the name is already calculated". My fault, that - predecessor of > that thing had been called lookup_one(); it took a string, calculated > its length and computed hash, then proceeded to do lookups. The latter > part could be reused in handling of rmdir et.al., where we already had > the component length and hash precalculated, so the tail of lookup_one() > had been carved out into a separate helper. Circa 2.3.99... > > Anyway, the name is _not_ fit for an export; I'm not sure what to call > it - lookup_one_qstr(), perhaps? Additional fun is due to the fact > that these days it is slightly different from the lookup_one() et.al. > Those can be called with directory held shared; that allows parallel > lookups, but it's not free of cost - if we run into a cache miss and need > to allocate a new dentry and talk to filesystem, we have to recheck the > hash table after allocation. __lookup_hash() is called only with parent > held exclusive and it can skip that fun - hash miss is going to remain > a miss; nobody else will be able to insert stuff into dcache in that > directory until we unlock it. > > What I'm worried about is that renaming it to lookup_one_qstr() will > be an invitation for "oh, we happen to have hash/len already known by the > time of that lookup_one() call; let's just convert it to lookup_one_qsrt()" > and if that happens in a place where the parent is held only shared, we'll > be in trouble. OTOH, lookup_one_qstr_excl() sounds like an invitation to > do something painful to whoever's responsible for such name... > > Suggestions, anyone? Plus, __lookup_hash() doesn't do permission checking so naming it lookup_one_qstr() seems problematic from that perspective as well... Don't kill me but: __excl_qstr_lookup_noperm() ?