On Tue, 13 Sep 2022, Al Viro wrote: > On Tue, Sep 13, 2022 at 02:41:51PM +1000, NeilBrown wrote: > > On Mon, 21 Feb 2022, Miklos Szeredi wrote: > > > There has been a longstanding race condition between rename(2) and link(2), > > > when those operations are done in parallel: > > > > > > 1. Moving a file to an existing target file (eg. mv file target) > > > 2. Creating a link from the target file to a third file (eg. ln target > > > link) > > > > > > By the time vfs_link() locks the target inode, it might already be unlinked > > > by rename. This results in vfs_link() returning -ENOENT in order to > > > prevent linking to already unlinked files. This check was introduced in > > > v2.6.39 by commit aae8a97d3ec3 ("fs: Don't allow to create hardlink for > > > deleted file"). > > > > > > This breaks apparent atomicity of rename(2), which is described in > > > standards and the man page: > > > > > > "If newpath already exists, it will be atomically replaced, so that > > > there is no point at which another process attempting to access > > > newpath will find it missing." > > > > > > The simplest fix is to exclude renames for the complete link operation. > > > > Alternately, lock the "from" directory as well as the "to" directory. > > That would mean using lock_rename() and generally copying the structure > > of do_renameat2() into do_linkat() > > Ever done cp -al? Cross-directory renames are relatively rare; cross-directory > links can be fairly heavy on some payloads, and you'll get ->s_vfs_rename_mutex > held a _lot_. As long as the approach is correct, it might be a good starting point for optimisation. We only need s_vfs_rename_mutex for link() so that we can reliably determine any parent/child relationship to get correct lock ordering to avoid deadlocks. So if we use trylock on the second lock and succeed then we don't need the mutex at all. e.g. - lookup the parents of both paths. - lock the "to" directory. - if the "from" directory is the same, or if a trylock of the from directory succeeds, then we have the locks and can perform the last component lookup and perform the link without racing with rename. - if the trylock fails, we drop the lock on "to" and use lock_rename(). We drop the s_vfs_rename_mutex immediately after lock_rename() so after the vfs_link() we just unlock both parent directories. That should avoid the mutex is most cases including "cp -al" Holding the "from" parent locked means that NFS could safely access the parent and basename, and send the lookup to the server to reduce the risk of a rename on one client racing with a link on another client. NFS doesn't guarantee that would still work (ops in a compound are not atomic) but it could still help. And the NFS server is *prevented* from making the lookup and link atomic. NeilBrown