Directory locking used to provide the following warranties: 1. Any read operations (lookup, readdir) are done with directory locked at least shared. 2. Any link creation or removal is done with directory locked exclusive. 3. Any link count changes are done with the object locked exclusive. 4. Any emptiness checks (for rmdir() or overwriting rename()) are done with the victim locked exclusive. 5. Any rename of a non-directory is done with the object locked exclusive (the last part is needed by nfsd). As far as directory contents is concerned, it very nearly amounted to "all reads are done with directory locked shared, all modifications - exclusive". There had been one gap in that, though - rename() can change the parent of subdirectory and strictly speaking that does modify the contents - ".." entry might need to be altered to match the new parent. For almost all filesystems it posed no problem - location and representation of ".." entry is fs-dependent, but it tends to be unaffected by any other directory modifications. However, in some cases it's not true - for example, a filesystem might have the contents of small directories kept directly in the inode, switched to separate allocation when enough entries are added. For such beasts we need an exclusion between modifying ".." and (at least) switchover from small to large directory format. One solution would be an fs-private locking inside the method, another - having cross-directory ->rename() take the normal lock on directory being moved. Or one could make vfs_rename() itself lock that directory instead, sparing the ->rename() instances all that headache. That had been done in 6.5; unfortunately, locking the moved subdirectory had been done in *all* cases, cross-directory or not. And that turns out to be more than a bit of harmless overlocking - deadlock prevention relies upon the fact that we never lock two directories that are not descendents of each other without holding ->s_vfs_rename_mutex. Kudos to Mo Zou for pointing to the holes in proof of correctness - that's what uncovered the problem... We could revert to pre-6.5 locking scheme, but there's a less painful solution; the cause of problem is same-directory case and in those there's no reason for ->rename() to touch the ".." entry at all - the parent does not change, so the modification of ".." would be tautological. Let's keep locking moved subdirectory in cross-directory move; that spares ->rename() instances the need to do home-grown exclusion. They need to be careful in one respect - if they do rely upon the exclusion between the change of ".." and other directory modifications, they should only touch ".." if the parent does get changed. Exclusion is still provided by the caller for such (cross-directory) renames. The series lives in git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs.git #work.rename; individual patches in followups. It does surivive local beating, but it needs more - additional review and testing would be very welcome. It starts with making sure that ->rename() instances are careful. Then the locking rules for rename get changed, so that we don't lock moved subdirectory in same-directory case. The proof of correctness gets updated^Wfixed - the current one had several holes. 1/9..6/9) (me and Jan) don't do tautological ".." changes in instances. reiserfs: Avoid touching renamed directory if parent does not change ocfs2: Avoid touching renamed directory if parent does not change udf_rename(): only access the child content on cross-directory rename ext2: Avoid reading renamed directory if parent does not change ext4: don't access the source subdirectory content on same-directory rename f2fs: Avoid reading renamed directory if parent does not change 7/9) rename(): fix the locking of subdirectories We should never lock two subdirectories without having taken ->s_vfs_rename_mutex; inode pointer order or not, the "order" proposed in 28eceeda130f "fs: Lock moved directories" is not transitive, with the usual consequences. The rationale for locking renamed subdirectory in all cases was the possibility of race between rename modifying .. in a subdirectory to reflect the new parent and another thread modifying the same subdirectory. For a lot of filesystems that's not a problem, but for some it can lead to trouble (e.g. the case when short directory contents is kept in the inode, but creating a file in it might push it across the size limit and copy its contents into separate data block(s)). However, we need that only in case when the parent does change - otherwise ->rename() doesn't need to do anything with .. entry in the first place. Some instances are lazy and do a tautological update anyway, but it's really not hard to avoid. Amended locking rules for rename(): find the parent(s) of source and target if source and target have the same parent lock the common parent else lock ->s_vfs_rename_mutex lock both parents, in ancestor-first order; if neither is an ancestor of another, lock the parent of source first. find the source and target. if source and target have the same parent if operation is an overwriting rename of a subdirectory lock the target subdirectory else if source is a subdirectory lock the source if target is a subdirectory lock the target lock non-directories involved, in inode pointer order if both source and target are such. That way we are guaranteed that parents are locked (for obvious reasons), that any renamed non-directory is locked (nfsd relies upon that), that any victim is locked (emptiness check needs that, among other things) and subdirectory that changes parent is locked (needed to protect the update of .. entries). We are also guaranteed that any operation locking more than one directory either takes ->s_vfs_rename_mutex or locks a parent followed by its child. 8/9) kill lock_two_inodes() Folded into the sole caller and simplified - it doesn't need to deal with the mix of directories and non-directories anymore. 9/9) rename(): avoid a deadlock in the case of parents having no common ancestor ... and fix the directory locking documentation and proof of correctness. Holding ->s_vfs_rename_mutex *almost* prevents ->d_parent changes; the case where we really don't want it is splicing the root of disconnected tree to somewhere. In other words, ->s_vfs_rename_mutex is sufficient to stabilize "X is an ancestor of Y" only if X and Y are already in the same tree. Otherwise it can go from false to true, and one can construct a deadlock on that. Make lock_two_directories() report an error in such case and update the callers of lock_rename()/lock_rename_child() to handle such errors. The ones that could get an error, that is - e.g. debugfs_rename() is never asked to change the parent and shouldn't be using lock_rename() in the first place; that's a separate series, though.