Re: [RFC][PATCH] ext2: Resolve i_nlink race in ext2_rename

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 02/24/2011 03:20 AM, Jan Kara wrote:
> On Thu 24-02-11 06:37:49, Al Viro wrote:
>> On Wed, Feb 23, 2011 at 10:21:41PM -0800, Josh Hunt wrote:
>>> [resending: left Jan off the original mail by accident]
>>>
>>> We have a multi-threaded workload which is currently "losing" files in the form
>>> of unattached inodes. The workload is link, rename, unlink intensive. This is
>>> happening on an ext2 filesystem and have reproduced the issue in kernel
>>> 2.6.37.  Here's a sample strace:
>>>
>>>     open("/a/tmp/tmpfile.1296184058", O_WRONLY|O_CREAT|O_TRUNC|O_LARGEFILE, 0666) = 9
>>>     link("/a/tmp/tmpfile.1296184058", "/a/tmp/tmpfile.28117.1296184059") = 0
>>>     rename("/a/tmp/tmpfile.28117.1296184059", "/a/tmp/tmpfile") = 0
>>>     stat64("/a/tmp/tmpfile", {st_mode=S_IFREG|0644, st_size=24248267, ...}) = 0
>>>     link("/a/tmp/tmpfile", "/a/tmp/submit/tmpfile")        = 0
>>>     open("/a/tmp/tmpfile.1296184058", O_RDONLY) = 13
>>>     open("/a/tmp/submit/tmpfile.send.q9SNoL", O_RDWR|O_CREAT|O_EXCL, 0600) = 824
>>>     rename("/a/tmp/submit/tmpfile", "/a/tmp/submit/tmpfile.send.q9SNoL")                = 0
>>>     unlink("/a/tmp/tmpfile.1296184058") = 0
>>>     open("/a/tmp/submit/tmpfile.send.q9SNoL", O_RDONLY) = 827
>>>     open("/a/tmp/submit/tmpfile.send.q9SNoL", O_RDONLY) = 828
>>>     open("/a/tmp/submit/tmpfile.send.q9SNoL", O_RDONLY) = 829
>>>     unlink("/a/tmp/submit/tmpfile.send.q9SNoL") = 0
>>>
>>> The application behavior shown above repeats indefinitely with most filenames
>>> changing during each iteration except for 'tmpfile'. Looking into this issue I
>>> see that vfs_rename_other() only takes i_mutex for the new inode and the new
>>> inode's directory as well as the old directory's mutex. This works for
>>> modifying the dir entry and appears to be fine for most filesystems, but
>>> ext2 and a few others (exofs, minix, nilfs2, omfs, sysv, ufs) modify i_nlink
>>> inside of their respective rename functions without grabbing the i_mutex. The
>>> modifications are done through calls to inode_inc_link_count(old_inode) and
>>> inode_dec_link_count(old_inode), etc.
>>>
>>> Taking the mutex for the old inode appears to resolve the issue of the
>>> lost files/unattached inodes that I am seeing with this workload. I've attached
>>> a patch below doing what I've described above. If this is an accepted solution
>>> I believe other filesystems may also be affected by this and I could provide
>>> a patch for those as well.
>>
>> I don't know...  The thing is, we mostly do that to make life easier for
>> fsck in case of crash.  Other than that, there's no reason to play with
>> link count of that sucker at all.  The question is, do we really want
>> such rename() interrupted by dirty shutdown to result in what looks like two
>> legitimate links to that inode without any indications of what had happened?
>> Note that fsck (at least on ext2) will correct link counts anyway and if
>> nothing else, we probably want some noise pointing to the inode in question...
>   Yeah, I agree that playing with the link count is not worth it. It is
> even more disputable because it would have some reasonable effect only if
> we happened to write out the moved inode after it is linked to the new
> directory and before it is unlinked from the old one. Moreover we'd need
> to writeout the new directory and not the old directory before crash
> happens. All this is highly unlikely and even if that happens, it is
> questionable whether the result is worth it. So I'll just do away with
> those games with link count...
>   The patch is attached. Josh, can you test it as well? Thanks.
> 
> 								Honza
Jan

I'm not seeing the problem with your patch as was expected since we're
not messing with i_nlink anymore. Al suggested marking the inode as
dirty where we were previously doing the old_inode dec. I believe this
is needed as well since we are updating it's ctime. I've attached a
version marking the inode dirty and it also fixes the comment making
reference to calling inode_dec_link_count().

I'm not completely clear on the historical reasons for messing with the
link count of old_inode in the first place. It was just to simulate the
linking and unlinking of the old_inode?

Thanks
Josh
>From cb347219bc21622dee06491b689c711e452005ce Mon Sep 17 00:00:00 2001
From: Josh Hunt <johunt@xxxxxxxxxx>
Date: Thu, 24 Feb 2011 11:56:43 -0800
Subject: [PATCH] ext2: Resolve i_nlink count corruption with heavy unlink and rename

old_inode's i_nlink count can become corrupt in ext2_rename() because we do not
grab it's i_mutex in vfs_rename_other(). Looking into this further it has been
determined there is no good reason for changing old_inode's link counts at all.
Instead we just mark the inode as dirty when we are done.

CC: Al Viro  <viro@xxxxxxxxxxxxxxxxxx>
CC: Jan Kara <jack@xxxxxxx>
Signed-off-by: Josh Hunt <johunt@xxxxxxxxxx>
---
 fs/ext2/namei.c |    9 ++-------
 1 files changed, 2 insertions(+), 7 deletions(-)

diff --git a/fs/ext2/namei.c b/fs/ext2/namei.c
index 2e1d834..adb9185 100644
--- a/fs/ext2/namei.c
+++ b/fs/ext2/namei.c
@@ -344,7 +344,6 @@ static int ext2_rename (struct inode * old_dir, struct dentry * old_dentry,
 		new_de = ext2_find_entry (new_dir, &new_dentry->d_name, &new_page);
 		if (!new_de)
 			goto out_dir;
-		inode_inc_link_count(old_inode);
 		ext2_set_link(new_dir, new_de, new_page, old_inode, 1);
 		new_inode->i_ctime = CURRENT_TIME_SEC;
 		if (dir_de)
@@ -356,12 +355,9 @@ static int ext2_rename (struct inode * old_dir, struct dentry * old_dentry,
 			if (new_dir->i_nlink >= EXT2_LINK_MAX)
 				goto out_dir;
 		}
-		inode_inc_link_count(old_inode);
 		err = ext2_add_link(new_dentry, old_inode);
-		if (err) {
-			inode_dec_link_count(old_inode);
+		if (err)
 			goto out_dir;
-		}
 		if (dir_de)
 			inode_inc_link_count(new_dir);
 	}
@@ -369,12 +365,11 @@ static int ext2_rename (struct inode * old_dir, struct dentry * old_dentry,
 	/*
 	 * Like most other Unix systems, set the ctime for inodes on a
  	 * rename.
-	 * inode_dec_link_count() will mark the inode dirty.
 	 */
 	old_inode->i_ctime = CURRENT_TIME_SEC;
+	mark_inode_dirty(old_inode);

 	ext2_delete_entry (old_de, old_page);
-	inode_dec_link_count(old_inode);

 	if (dir_de) {
 		if (old_dir != new_dir)
--
1.7.0.4


[Index of Archives]     [Reiser Filesystem Development]     [Ceph FS]     [Kernel Newbies]     [Security]     [Netfilter]     [Bugtraq]     [Linux FS]     [Yosemite National Park]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Samba]     [Device Mapper]     [Linux Media]

  Powered by Linux