Re: [RFC][PATCH] do d_instantiate/unlock_new_inode combinations safely

Al Viro <viro@xxxxxxxxxxxxxxxxxx> · Fri, 11 May 2018 01:39:01 +0100

On Fri, May 11, 2018 at 08:56:07AM +1000, Dave Chinner wrote:

> > For anything NFS-exported we do _not_ want to unlock new inode
> > before it has grown an alias; original set of fixes got the
> > ordering right, but missed the nasty complication in case of
> > lockdep being enabled - unlock_new_inode() does
> >     lockdep_annotate_inode_mutex_key(inode)
> > which can only be done before anyone gets a chance to touch
> > ->i_mutex.  Unfortunately, flipping the order and doing
> > unlock_new_inode() before d_instantiate() opens a window when
> > mkdir can race with open-by-fhandle on a guessed fhandle, leading
> > to multiple aliases for a directory inode and all the breakage
> > that follows from that.
> > 
> >     Correct solution: a new primitive (d_instantiate_new())
> > combining these two in the right order - lockdep annotate, then
> > d_instantiate(), then the rest of unlock_new_inode().  All
> > combinations of d_instantiate() with unlock_new_inode() should
> > be converted to that.
> 
> Ok, so this seems to touch only the paths that create new inodes
> (mkdir, mknod, etc). Is the lookup path that does:
> 
> 
> 	unlock_new_inode()
> 	.....
> 	d_splice_alias(inode, dentry);
> 
> OK?

Yes.  d_splice_alias()
	* will do the right thing when it runs into directory inode
that already has an alias
	* is called from ->d_lookup(), which has calling conventions
allowing to return a preexisting alias

The race in question is between mkdir() and open-by-fhandle that manages
to guess an fhandle for directory about to be created.  mkdir() side
creates a new inode, inserts it into icache (locked) and proceeds towards
unlock_new_inode()/d_instantiate().  Suppose it loses CPU right after
unlock_new_inode() and open-by-fhandle picks the inode from icache
(either having just gotten there, or finally gets woken up after having
waited for the sucker to get unlocked).  inode is valid, everything's
set up properply, so we pass it to d_obtain_alias(), which sees that
there's no exiting dentries, allocates one, rechecks, finds that there's
still nothing and proceeds to attach its new anon dentry to that inode.
Now mkdir regains CPU and does d_instantiate().  And we are fucked -
there are *two* dentries for given directory inode.

The window is narrow - to have a chance to hit it you need either
to run it in a VM or have security_d_instantiate() (from d_instantiate())
to do something slow (ideally - blocking).  It's non-empty, though.

Doing it in the opposite order (as XFS does on mkdir et.al.) plugs that
window - open-by-fhandle won't get to the inode until after mkdir has
attached a dentry to it.  Then d_obtain_alias() will simply return that
dentry and we are done.  It's only d_instantiate() (or d_add()) that is
a problem - d_splice_alias() is fine, so on the lookup path we don't
need anything like that.  d_add_ci() is like d_splice_alias() in that
respect, so the lookup is OK in case-insensitive variant as well.

So it would appear that XFS doesn't need to be touched.  HOWEVER,
lockdep shite *can't* be done after something has had a chance to grab
the damn rwsem.  I really wonder if
		d_instantiate(dentry, inode);
        xfs_finish_inode_setup(cip);
doesn't lead to unpleasantness with lockdep enabled:
	xfs_finish_inode_setup() -> unlock_new_inode() ->
lockdep_annotate_inode_mutex_key() -> init_rwsem(&inode->i_rwsem)
which does wonders if something has already gotten to the inode
via that dentry and tried e.g. lock_inode() on it.