Re: [PATCH 1/1] fiemap: move i_op->fiemap() locking to the ioctl_fiemap()

"Kasatkin, Dmitry" <dmitry.kasatkin@xxxxxxxxx> · Mon, 24 Sep 2012 14:28:15 +0300

On Mon, Sep 24, 2012 at 12:18 PM, Dave Chinner <david@xxxxxxxxxxxxx> wrote:
> On Mon, Sep 24, 2012 at 11:13:30AM +0300, Kasatkin, Dmitry wrote:
>> On Sat, Sep 22, 2012 at 1:59 AM, Dave Chinner <david@xxxxxxxxxxxxx> wrote:
>> > On Fri, Sep 21, 2012 at 05:14:28PM +0300, Dmitry Kasatkin wrote:
>> >> On some filesystems calling i_op->fiemap() takes the i_mutex,
>> >> on others it doesn't.
>> >
>> > Exactly by design. Many fiesystems don't require the i_mutex for
>> > fiemap to be safe. e.g. the extent map is not protected by the
>> > i_mutex in XFS or ext4, and so holding the i_mutex over fiemap is
>> > incorrect.
>> >
>>
>> Please read an explanation bellow, where I tell what is EVM.
>>
>> But in my opinion it is a a bit wrong design if caller does not know
>> if any lock will or will not be locked.
>
> We've got plenty of interfaces where this is true. Look at .fsync,
> for example:
>
> $ gl -n 1 02c24a82
> commit 02c24a82187d5a628c68edfe71ae60dc135cd178
> Author: Josef Bacik <josef@xxxxxxxxxx>
> Date:   Sat Jul 16 20:44:56 2011 -0400
>
>     fs: push i_mutex and filemap_write_and_wait down into ->fsync() handlers
>
>     Btrfs needs to be able to control how filemap_write_and_wait_range() is called
>     in fsync to make it less of a painful operation, so push down taking i_mutex and
>     the calling of filemap_write_and_wait() down into the ->fsync() handlers.  Some
>     file systems can drop taking the i_mutex altogether it seems, like ext3 and
>     ocfs2.  For correctness sake I just pushed everything down in all cases to make
>     sure that we keep the current behavior the same for everybody, and then each
>     individual fs maintainer can make up their mind about what to do from there.
>     Thanks,
>
>     Acked-by: Jan Kara <jack@xxxxxxx>
>     Signed-off-by: Josef Bacik <josef@xxxxxxxxxx>
>     Signed-off-by: Al Viro <viro@xxxxxxxxxxxxxxxxxx>
>
> i.e. i_mutex used to be wrapped around the outside of .fsync, but
> that proved to be a problem for btrfs and other filesystems that *do
> not need i_mutex* during fsync. Hence the locking got pushed down
> into the filesystems, and those that did not need i_mutex got rid of
> it. Now the only rule for calling .fsync is that you can't hold
> i_mutex when calling it.

I see.

>
>> Linux kernel has many non-locking functions and its locking wrappers.
>> Many callees requires and state that caller must lock something.
>
> Yes, they do, just like many callees require and state that the
> caller must *not* lock something. Say, like might_sleep()....
>

Fair enough.

> The requirement for calling .fiemap/.fync/.fallocate/etc is that the
> "caller must not hold i_mutex". Essentially, all new inode/file
> methods we have added in the past couple of years make the
> assumption that serialisation is the problem of the filesystem, not
> the VFS. This has been done because there is not a
> "one-size-fits-all" method for fine-grained locking in filesystems.
>
>> > As a result, extent maps can change even when someone is holding the
>> > i_mutex. Any design that requires i_mutex to provide stable,
>> > unchanging extent maps is fundamentally broken....
>> >
>>
>> It is not necessarily that i_mutex to be locked to provide stable
>> unchanging extent maps.
>> FS might not require that.
>> But the caller might require some sort of atomicity.
>
> fiemap is not the interface you want, then. it gives *zero*
> guarantee that what it returns actually reflects what is on disk.
> What is on disk can change the moment the filesystem drops the lock
> that protects the extent map, and because i_mutex doesn't protect
> the extent map, you aren't getting the atomicity you require....
>

I use filemap_write_and_wait() before using fiemap.
And it seems to be persistent between reboots on ext4.

In our case i_mutex is not for FS itself, but for EVM to ensure
that hash over extents and its value in xattrs are stay in sync.

> i.e. fiemap provides a point in time snapshot that is out of date
> even before the snapshot is returned to the caller regardless of the
> locks held by the caller.
>

So what is the way at VFS to ensure that what ever is returned in fiemap
stays the same over reboots?

ioctl_fiemap() uses filemap_write_and_wait()...
And it seems to work well for me as well...

>> > So that's a NACK from me...
>> >
>> >> This change is needed in preparation for EVM to include a hash of
>> >> the extent data to be used in the HMAC calculation. EVM is called
>> >> when the i_mutex has already been taken.
>> >
>> > What's EVM? What's HMAC? and what is it actually checksumming? The
>> > extent map, or the data in the file? And if it is the extent map,
>> > what's the purpose of using the extent map for this? You need to
>> > explain how this is intended to be used, not use TLAs that people
>> > unfamiliar with your application will not understand...
>> >
>>
>> EVM - Extended Verification Module: security/integrity/evm
>>
>> That is a part of integrity subsystem which provides protection
>> against offline modification of
>> a inode metadata: uid, gid, perm, ino, xattrs, etc..
>
> *groan*
>
>> EVM has several hooks, which are called when those above metadata changes.
>> And they are called when i_mutex is locked.
>> i_mutex is used to prevent racing between some security verification
>> and file metadata attribute update.
>
> Sounds great, until you consider the fact that inode metadata can be
> updated without i_mutex being held. e.g. fallocate vs fiemap.  Have
> you ever had a look at how many different ioctls various filesystems
> implement that modify metadata without needing to hold i_mutex?
>

What inode metadata are you talking about?

do_truncate:
	mutex_lock(&dentry->d_inode->i_mutex);
	ret = notify_change(dentry, &newattrs);
	mutex_unlock(&dentry->d_inode->i_mutex);

chmod_common:
	error = notify_change(path->dentry, &newattrs);
	mutex_unlock(&inode->i_mutex);

chown_common:
	mutex_lock(&inode->i_mutex);
	error = security_path_chown(path, user, group);
	if (!error)
		error = notify_change(path->dentry, &newattrs);
	mutex_unlock(&inode->i_mutex);

It all happens with i_mutex locked.
They all call then
	if (inode->i_op->setattr)
		error = inode->i_op->setattr(dentry, attr);

>> One of the patches we are working on now would like to use extent map
>> information to prevent certain attack.
>
> I'm not a mind reader. What attack may that be?

That is a simple attack of modifying inode block map so that different
files points to the same blocks.
FSCK calls them "multiply claimed blocks"...

>
>> Of course as above we do it in EVM hooks which are called under i_mutex.
>> And more over we need to ensure atomicity when reading extent map and
>> updating xattrs - just the same as it works now
>> for other inode metadata.
>
> Fundamentally, extent maps are not protected by i_mutex, and that's
> where your design breaks down. What you want cannot be provided by
> i_mutex, and that's one of the reasons why i_mutex is not held above
> .fiemap - it's not appropriate locking for the operation being
> performed.

Really, as I said, we are under i_mutex not to protect FS,
but to make sure that extent content and extended attribute stays in sync.

>
> Here's a beautiful example: XFS_IOC_SWAPEXT.
>
> This filesystem specific ioctl() atomically swaps the extent maps
> between two inodes.  And it doesn't hold i_mutex to do it, and it
> doesn't even inform the VFS that it has happened. Indeed, it is
> *completely transparent* to the VFS because it's a special interface
> to the filesystem used by filesystem maintenance tools, namely the
> online defragmenter xfs_fsr. That means running xfs_fsr will really
> screw with any EVM requirement that special security attributes and
> extent maps are kept in sync....
>

That is a tough case then if VFS is unaware.
For integrity protection it might be a requirement not to use such
ways to modify file system.

But ok.. You are aware of the problem.
What is your suggestion on how we could get some integrity measurement
at VFS layer which
allows to identify if inode block mapping has changed?

Thanks,

Dmitry

> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@xxxxxxxxxxxxx
> --
> To unsubscribe from this list: send the line "unsubscribe linux-security-module" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html