Re: kernel BUG when fsync'ing file in a overlayfs merged dir, located on btrfs

Jeff Mahoney <jeffm@xxxxxxxx> · Thu, 24 Mar 2016 11:31:56 -0400

On 3/24/16 11:20 AM, Al Viro wrote:
> On Thu, Nov 05, 2015 at 11:03:58PM -0500, Jeff Mahoney wrote:
> 
>> I suppose the irony here is that, AFAIK, that code is to ensure a file
>> doesn't get lost between transactions due to rename.
>>
>> Isn't the file->f_path.dentry relationship stable otherwise, though? The
>> name might change and the parent might change but the dentry that the
>> file points to won't.
> 
> Sure, it is stable.  But that doesn't guarantee anything about the
> ancestors.
> 
> btrfs_log_inode_parent() is a mess.  Look at that loop you've got there:
> 
>        while (1) {
>                 if (!parent || d_really_is_negative(parent) || sb != d_inode(parent)->i_sb)
>                         break;
> 
> Really?  NULL from dget_parent()?  A negative from dget_parent()?  Sodding
> superblock changing from transition to parent?
> 
>                 inode = d_inode(parent);
>                 if (root != BTRFS_I(inode)->root)
>                         break;
> 
> Umm...  Something like crossing a snapshot boundary?
> 
>                 if (BTRFS_I(inode)->generation > last_committed) {  
>                         ret = btrfs_log_inode(trans, root, inode,
>                                               LOG_INODE_EXISTS,
>                                               0, LLONG_MAX, ctx);
>                         if (ret)
>                                 goto end_trans;
>                 }
>                 if (IS_ROOT(parent))
>                         break;
> 
>                 parent = dget_parent(parent);
>                 dput(old_parent);
>                 old_parent = parent;
>         }
> 
> Note that there's not a damn thing to guarantee that those directories have
> anything to do with your file - race with rename(2) easily could've sent you
> chasing the wrong chain of ancestors.  Incidentally, what will happen if
> you do that fsync() to an unlinked file?  It still has ->d_parent pointing
> to the what used to be its parent (quite possibly itself already rmdir'ed and
> pinned down by that reference; inode is still busy, as if we'd done open and
> rmdir).  Is that what those checks are attempting to catch?  And what happens
> if rmdir happens between the check and btrfs_log_inode()?
> 
> The thing is, playing with ancestors of opened file is very easy to get
> wrong, regardless of overlayfs involvement.  And this code is anything
> but clear.  I don't know btrfs guts enough to be able to tell how much
> can actually break (as opposed to adding wrong inodes to transaction),
> but I would really suggest taking a good look at what's going on there...

Yeah, absolutely.  The file_dentry() patch that you and Miklos worked
out the other day fixes the fsync crashes for us when we use it in
btrfs_file_sync but you're 100% correct in your opinion of this code.

>> I did find a few other places where that assumption happens without any
>> questionable traversals.  Sure, all three are in file systems unlikely
>> to be used with overlayfs.
>>
>> ocfs2_prepare_inode_for_write uses file->f_path.dentry for
>> should_remove_suid (due to needing to do it early since cluster locking
>> is unknown in setattr, according to the commit).  Having
>> should_remove_suid operate on an inode would solve that easily.
> 
> Can't do - there are filesystems that _need_ dentry for ->setattr().
>
> 
> Grr...  Sorry, misread what you'd written.  should_remove_suid()
> ought to be switched to inode, indeed.

Ok, I'll write that up.

>> cifs_new_fileinfo keeps a reference to the dentry but it seems to be
>> used mostly to access the inode except for the nasty-looking call to
>> build_path_from_dentry in cifs_reopen_file, which I won't be touching.
>> That does look like a questionable traversal, especially with the "we
>> can't take the rename lock here" comment.
> 
> cifs uses fs-root-relative pathnames in protocol; nothing to do there.
> Where do you see that comment, BTW?  It uses read_seqbegin/read_seqretry
> on rename_lock there...

I must've been looking at old code.  I don't see it now either.

-Jeff

-- 
Jeff Mahoney

Attachment:
signature.asc

Description: OpenPGP digital signature