Re: [RFC][PATCH] do d_instantiate/unlock_new_inode combinations safely

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 





On 5/11/2018 6:09 AM, Al Viro wrote:
On Fri, May 11, 2018 at 08:56:07AM +1000, Dave Chinner wrote:

For anything NFS-exported we do _not_ want to unlock new inode
before it has grown an alias; original set of fixes got the
ordering right, but missed the nasty complication in case of
lockdep being enabled - unlock_new_inode() does
     lockdep_annotate_inode_mutex_key(inode)
which can only be done before anyone gets a chance to touch
->i_mutex.  Unfortunately, flipping the order and doing
unlock_new_inode() before d_instantiate() opens a window when
mkdir can race with open-by-fhandle on a guessed fhandle, leading
to multiple aliases for a directory inode and all the breakage
that follows from that.

     Correct solution: a new primitive (d_instantiate_new())
combining these two in the right order - lockdep annotate, then
d_instantiate(), then the rest of unlock_new_inode().  All
combinations of d_instantiate() with unlock_new_inode() should
be converted to that.

Ok, so this seems to touch only the paths that create new inodes
(mkdir, mknod, etc). Is the lookup path that does:


	unlock_new_inode()
	.....
	d_splice_alias(inode, dentry);

OK?

Yes.  d_splice_alias()
	* will do the right thing when it runs into directory inode
that already has an alias
	* is called from ->d_lookup(), which has calling conventions
allowing to return a preexisting alias

The race in question is between mkdir() and open-by-fhandle that manages
to guess an fhandle for directory about to be created.  mkdir() side
creates a new inode, inserts it into icache (locked) and proceeds towards
unlock_new_inode()/d_instantiate().  Suppose it loses CPU right after
unlock_new_inode() and open-by-fhandle picks the inode from icache
(either having just gotten there, or finally gets woken up after having
waited for the sucker to get unlocked).  inode is valid, everything's
set up properply, so we pass it to d_obtain_alias(), which sees that
there's no exiting dentries, allocates one, rechecks, finds that there's
still nothing and proceeds to attach its new anon dentry to that inode.
Now mkdir regains CPU and does d_instantiate().  And we are fucked -
there are *two* dentries for given directory inode.

The window is narrow - to have a chance to hit it you need either
to run it in a VM or have security_d_instantiate() (from d_instantiate())
to do something slow (ideally - blocking).  It's non-empty, though.

Doing it in the opposite order (as XFS does on mkdir et.al.) plugs that
window - open-by-fhandle won't get to the inode until after mkdir has
attached a dentry to it.  Then d_obtain_alias() will simply return that
dentry and we are done.  It's only d_instantiate() (or d_add()) that is
a problem - d_splice_alias() is fine, so on the lookup path we don't
need anything like that.  d_add_ci() is like d_splice_alias() in that
respect, so the lookup is OK in case-insensitive variant as well.


Nice explanation -Could we add this description to the commit msg and also document about this API in - "Documentation/filesystems/vfs.txt" under heading "Directory Entry Cache API".
That would be helpful for others later as well.

Thanks
Ritesh

So it would appear that XFS doesn't need to be touched.  HOWEVER,
lockdep shite *can't* be done after something has had a chance to grab
the damn rwsem.  I really wonder if
		d_instantiate(dentry, inode);
         xfs_finish_inode_setup(cip);
doesn't lead to unpleasantness with lockdep enabled:
	xfs_finish_inode_setup() -> unlock_new_inode() ->
lockdep_annotate_inode_mutex_key() -> init_rwsem(&inode->i_rwsem)
which does wonders if something has already gotten to the inode
via that dentry and tried e.g. lock_inode() on it.


--
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc. Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project.



[Index of Archives]     [Linux Ext4 Filesystem]     [Union Filesystem]     [Filesystem Testing]     [Ceph Users]     [Ecryptfs]     [AutoFS]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux Cachefs]     [Reiser Filesystem]     [Linux RAID]     [Samba]     [Device Mapper]     [CEPH Development]

  Powered by Linux