Possible FS race condition between vfs_rename and do_linkat (fs/namei.c)

Xavier Roche <xavier.roche@xxxxxxxxxxx> · Sat, 24 Aug 2019 09:24:40 +0200

Dear distinguished filesystem contributors,

There seem to be a race condition between vfs_rename and do_linkat,
when those operations are done in parallel:

1. Moving a file to a target file (eg. mv file target)
2. Creating a link from the target file (eg. ln target link)

My understanding is that as the target file is never erased on client
side, but just replaced, the link should never fail.

But maybe this is something the filesystem can not guarantee at all
(w.r.t POSIX typically) ?

To demonstrate this issue, just run the following script (it will in
loop move "file" to "target", and in parallel link "target" to "link")
:

========== Cut here ==========
#!/bin/bash
#

rm -f link file target
touch target

# Link target -> link in loop
while ln target link && rm link; do :; done &

# Overwrite file -> target in loop
while touch file && mv file target; do :; done &

wait
========== Cut here ==========

Running the script will yield:
./bug.sh
ln: failed to create hard link 'link' => 'target': No such file or directory

The issue seem to lie inside vfs_link (fs/namei.c):
       inode_lock(inode);
       /* Make sure we don't allow creating hardlink to an unlinked file */
       if (inode->i_nlink == 0 && !(inode->i_state & I_LINKABLE))
               error =  -ENOENT;

The possible answer is that the inode refcount is zero because the
file has just been replaced concurrently, old file being erased, and
as such, the link operation is failing.

Patching with this very naive fix "solves" the issue (but this is
probably not something we want):

diff --git a/fs/namei.c b/fs/namei.c
index 209c51a5226c..befb15f4b865 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -4231,9 +4231,10 @@ int vfs_link(struct dentry *old_dentry, struct
inode *dir, struct dentry *new_de

        inode_lock(inode);
        /* Make sure we don't allow creating hardlink to an unlinked file */
-       if (inode->i_nlink == 0 && !(inode->i_state & I_LINKABLE))
-               error =  -ENOENT;
-       else if (max_links && inode->i_nlink >= max_links)
+       //if (inode->i_nlink == 0 && !(inode->i_state & I_LINKABLE))
+       //      error =  -ENOENT;
+       // else
+       if (max_links && inode->i_nlink >= max_links)
                error = -EMLINK;
        else {
                error = try_break_deleg(inode, delegated_inode);

Kudos to Xavier Grand from Algolia for spotting the issue with a
reproducible case.

-- 
Xavier Roche -
xavier.roche at algolia.com