Re: [GIT PULL] ksmbd server fixes

Namjae Jeon <linkinjeon@xxxxxxxxxx> · Fri, 1 Apr 2022 21:52:09 +0900



2022-04-01 10:20 GMT+09:00, Al Viro <viro@xxxxxxxxxxxxxxxxxx>:
> On Tue, Mar 29, 2022 at 06:48:51PM -0700, Linus Torvalds wrote:
>> On Mon, Mar 28, 2022 at 6:39 PM Steve French <smfrench@xxxxxxxxx> wrote:
>> >
>> > - four patches relating to dentry race
>>
>> I took a look at this, and I absolutely hate it.
>>
>> You are basically exporting vfs-internal functions (__lookup_hash() in
>> particular) that really shouldn't be exported.
>>
>> And the _reason_ you are exporting them seems to be that you have
>> copied most of do_renameat2() as ksmbd_vfs_rename() but then made
>> various small changes.
>>
>> This all makes me _very_ uncomfortable.
>>
>> I really get the feeling that we'd be much better off exporting
>> something that is just a bigger part of that rename logic, not that
>> really low-level __lookup_hash() thing.
>>
>> >From what I can tell, almost everything between that __lookup_hash()
>> and the final vfs_rename() call is shared, even to the point of error
>> label names (ok, you call it "out5", the vfs layer calls it "exit5".
>>
>> The main exception is that odd
>>
>>         parent_fp = ksmbd_lookup_fd_inode(old_parent->d_inode);
>>         if (parent_fp) {
>>               ...
>>
>> sequence, but that looks like it could have been done before the
>> __lookup_hash() too.
>
> If only...  Note that
>         if (old_dentry != old_child) {
> 		err = -ESTALE;
> 		dput(old_dentry);
> 		goto out4;
> 	}
> thing in there.  -ESTALE is actually wrong - what happens there is
> a lost race with rename.
>
> The source of headache is that they are trying to implement not
> "rename this entry in this directory to that entry in that directory";
> it's "rename an opened file, no matter which directory it's in".
>
> Comes from shit-for-brains protocol, that tries both to treat file
> name as a piece of metadata (which, in itself, would be defensible)
> *and* allow cross-directory moves.  Shoved into the same fchown-like
> scheme.
>
> So they end up with source already looked up.  We want both parents
> locked and we bloody well want them to be the parents of the objects
> we'll operate upon.
>
> What that code does is
> 	* pin a reference to current parent of source
> 	* look the parent of destination of up
> 	* lock both
> 	* ... and pray that child is still the child of what we'd locked.
> 	*	for that, take a snapshot of its name (it's not guaranteed
> 		to be stable otherwise)
> 	*	look that name up in what used to be its parent
> 	*	... and see if what we've got is our source.
>
> In case of mismatch (i.e. cross-directory rename at any point between
> dget_parent() and the end of lock_rename()) we fail with -ESTALE.  Note
> that we do *not* do anything of that sort if that cross-directory rename
> happens just prior to dget_parent() - that is accepted just fine.
>
> This is bogus.  That song and dance is done after we'd locked the
> thing that used to be the parent of source.  At that point nothing
> can move into or out of that sucker, so all we need is a check that
> old_child->d_parent is equal to old_parent and, probably, that it's
> still hashed.
>
> No re-lookup is needed.  What's more, I suspect that the right thing
> to do would be a specialized variant of lock_rename() that would
> take source and new parent and do the following:
>
> 	if (READ_ONCE(source->d_parent) == new_parent) {
> 		inode_lock_nested(new_parent->d_inode, I_MUTEX_PARENT);
> 		if (likely(source->d_parent == new_parent))
> 			return NULL;
> 	}
> 	// fuck that, looks like a cross-rename one.
> 	mutex_lock(&source->d_sb->s_vfs_rename_mutex);
> 	// now all ->d_parent are stable
> 	if (unlikely(source->d_parent == new_parent)) {
> 		inode_lock_nested(new_parent->d_inode, I_MUTEX_PARENT);
> 		// we want the same rules as for lock_rename()
> 		mutex_unlock(&source->d_sb->s_vfs_rename_mutex);
> 		return NULL;
> 	}
> 	// cross-directory it is...
> 	same as lock_rename() after having grabbed ->s_vfs_rename_mutex
>
> Pass old_child and new_path.dentry to this lock_rename() analogue.
> As soon as it returns, we know that old_child->d_parent is stable,
> that it and new_path.dentry are properly locked for rename.
> No dget_parent() needed (just dget(old_child->d_parent)), no need to
> bother with re-lookups, snapshots, etc. - check that old_child is
> still hashed and if it isn't, we really can return a hard error.
> Unlike this -ESTALE, it's not a transient condition.
>
> To be paired with unlock_rename() - just remember to pass it the
> value of old_child->d_parent taken *before* vfs_rename().  And
> it would have to live in fs/namei.c, right next to lock_rename(),
> obviously.
Thanks for your detailed explanation!
I will check it.

>
> I don't think that __lookup_hash() makes a good _name_ (or calling
> conventions) to expose, but it's not really different from e.g.
> lookup_one_len() as far exposing internals is concerned.
> We want the parent held exclusive and that's all there is to it.
>
> BTW, what the hell is this about?
>         if (d_is_symlink(new_path.dentry)) {
> 		err = -EACCES;
> 		goto out2;
> 	}
> new_path is the new parent; why bother with that check at all?
> It's not as if any kind of lookup in it would have a chance to
> succeed...
Let me check it again. I will remove this check.

Thanks!
>
>> I'd be even happier if we could actually hjave some common helper
>> starting at that
>>
>>         trap = lock_rename(...);
>>
>> because that whole locking and 'trap' logic is kind of a big deal, and
>> the locking is the only thing that makes __lookup_hash() work.
>>
>> Adding Al to the cc in case he has any ideas. Maybe he's ok with this
>> whole thing, and with his blessing I'll take this pull as-is, but as
>> it is I'm not comfortable with it.
>
> Unfortunately, it's not similar to rename(2) - different things given
> to it ;-/  Bits and pieces might be possible to factorize, but the
> difference in what we start from is fundamental.
>