Re: [PATCH v2] vfs: fix link vs. rename race

"NeilBrown" <neilb@xxxxxxx> · Wed, 14 Sep 2022 09:52:37 +1000

On Tue, 13 Sep 2022, Al Viro wrote:
> On Tue, Sep 13, 2022 at 02:41:51PM +1000, NeilBrown wrote:
> > On Mon, 21 Feb 2022, Miklos Szeredi wrote:
> > > There has been a longstanding race condition between rename(2) and link(2),
> > > when those operations are done in parallel:
> > > 
> > > 1. Moving a file to an existing target file (eg. mv file target)
> > > 2. Creating a link from the target file to a third file (eg. ln target
> > >    link)
> > > 
> > > By the time vfs_link() locks the target inode, it might already be unlinked
> > > by rename.  This results in vfs_link() returning -ENOENT in order to
> > > prevent linking to already unlinked files.  This check was introduced in
> > > v2.6.39 by commit aae8a97d3ec3 ("fs: Don't allow to create hardlink for
> > > deleted file").
> > > 
> > > This breaks apparent atomicity of rename(2), which is described in
> > > standards and the man page:
> > > 
> > >     "If newpath already exists, it will be atomically replaced, so that
> > >      there is no point at which another process attempting to access
> > >      newpath will find it missing."
> > > 
> > > The simplest fix is to exclude renames for the complete link operation.
> > 
> > Alternately, lock the "from" directory as well as the "to" directory.
> > That would mean using lock_rename() and generally copying the structure
> > of do_renameat2() into do_linkat()
> 
> Ever done cp -al?  Cross-directory renames are relatively rare; cross-directory
> links can be fairly heavy on some payloads, and you'll get ->s_vfs_rename_mutex
> held a _lot_.

As long as the approach is correct, it might be a good starting point
for optimisation.

We only need s_vfs_rename_mutex for link() so that we can reliably
determine any parent/child relationship to get correct lock ordering to
avoid deadlocks.  So if we use trylock on the second lock and succeed
then we don't need the mutex at all.
e.g.
 - lookup the parents of both paths.
 - lock the "to" directory.
 - if the "from" directory is the same, or if a trylock of the from
   directory succeeds, then we have the locks and can perform the
   last component lookup and perform the link without racing with
   rename. 
 - if the trylock fails, we drop the lock on "to" and use lock_rename().
   We drop the s_vfs_rename_mutex immediately after lock_rename()
   so after the vfs_link() we just unlock both parent directories.

That should avoid the mutex is most cases including "cp -al"

Holding the "from" parent locked means that NFS could safely access the
parent and basename, and send the lookup to the server to reduce the
risk of a rename on one client racing with a link on another client.
NFS doesn't guarantee that would still work (ops in a compound are not
atomic) but it could still help.  And the NFS server is *prevented* from
making the lookup and link atomic.

NeilBrown