On Fri, May 11, 2018 at 08:56:07AM +1000, Dave Chinner wrote: > > For anything NFS-exported we do _not_ want to unlock new inode > > before it has grown an alias; original set of fixes got the > > ordering right, but missed the nasty complication in case of > > lockdep being enabled - unlock_new_inode() does > > lockdep_annotate_inode_mutex_key(inode) > > which can only be done before anyone gets a chance to touch > > ->i_mutex. Unfortunately, flipping the order and doing > > unlock_new_inode() before d_instantiate() opens a window when > > mkdir can race with open-by-fhandle on a guessed fhandle, leading > > to multiple aliases for a directory inode and all the breakage > > that follows from that. > > > > Correct solution: a new primitive (d_instantiate_new()) > > combining these two in the right order - lockdep annotate, then > > d_instantiate(), then the rest of unlock_new_inode(). All > > combinations of d_instantiate() with unlock_new_inode() should > > be converted to that. > > Ok, so this seems to touch only the paths that create new inodes > (mkdir, mknod, etc). Is the lookup path that does: > > > unlock_new_inode() > ..... > d_splice_alias(inode, dentry); > > OK? Yes. d_splice_alias() * will do the right thing when it runs into directory inode that already has an alias * is called from ->d_lookup(), which has calling conventions allowing to return a preexisting alias The race in question is between mkdir() and open-by-fhandle that manages to guess an fhandle for directory about to be created. mkdir() side creates a new inode, inserts it into icache (locked) and proceeds towards unlock_new_inode()/d_instantiate(). Suppose it loses CPU right after unlock_new_inode() and open-by-fhandle picks the inode from icache (either having just gotten there, or finally gets woken up after having waited for the sucker to get unlocked). inode is valid, everything's set up properply, so we pass it to d_obtain_alias(), which sees that there's no exiting dentries, allocates one, rechecks, finds that there's still nothing and proceeds to attach its new anon dentry to that inode. Now mkdir regains CPU and does d_instantiate(). And we are fucked - there are *two* dentries for given directory inode. The window is narrow - to have a chance to hit it you need either to run it in a VM or have security_d_instantiate() (from d_instantiate()) to do something slow (ideally - blocking). It's non-empty, though. Doing it in the opposite order (as XFS does on mkdir et.al.) plugs that window - open-by-fhandle won't get to the inode until after mkdir has attached a dentry to it. Then d_obtain_alias() will simply return that dentry and we are done. It's only d_instantiate() (or d_add()) that is a problem - d_splice_alias() is fine, so on the lookup path we don't need anything like that. d_add_ci() is like d_splice_alias() in that respect, so the lookup is OK in case-insensitive variant as well. So it would appear that XFS doesn't need to be touched. HOWEVER, lockdep shite *can't* be done after something has had a chance to grab the damn rwsem. I really wonder if d_instantiate(dentry, inode); xfs_finish_inode_setup(cip); doesn't lead to unpleasantness with lockdep enabled: xfs_finish_inode_setup() -> unlock_new_inode() -> lockdep_annotate_inode_mutex_key() -> init_rwsem(&inode->i_rwsem) which does wonders if something has already gotten to the inode via that dentry and tried e.g. lock_inode() on it.