Re: [PATCH 1/1] fiemap: move i_op->fiemap() locking to the ioctl_fiemap()

"Kasatkin, Dmitry" <dmitry.kasatkin@xxxxxxxxx> · Wed, 26 Sep 2012 11:22:14 +0300

On Tue, Sep 25, 2012 at 4:56 AM, Dave Chinner <david@xxxxxxxxxxxxx> wrote:
> On Mon, Sep 24, 2012 at 02:28:15PM +0300, Kasatkin, Dmitry wrote:
>> On Mon, Sep 24, 2012 at 12:18 PM, Dave Chinner <david@xxxxxxxxxxxxx> wrote:
>> > On Mon, Sep 24, 2012 at 11:13:30AM +0300, Kasatkin, Dmitry wrote:
>> >> On Sat, Sep 22, 2012 at 1:59 AM, Dave Chinner <david@xxxxxxxxxxxxx> wrote:
>> >> > On Fri, Sep 21, 2012 at 05:14:28PM +0300, Dmitry Kasatkin wrote:
>> >> It is not necessarily that i_mutex to be locked to provide stable
>> >> unchanging extent maps.
>> >> FS might not require that.
>> >> But the caller might require some sort of atomicity.
>> >
>> > fiemap is not the interface you want, then. it gives *zero*
>> > guarantee that what it returns actually reflects what is on disk.
>> > What is on disk can change the moment the filesystem drops the lock
>> > that protects the extent map, and because i_mutex doesn't protect
>> > the extent map, you aren't getting the atomicity you require....
>> >
>>
>> I use filemap_write_and_wait() before using fiemap.
>> And it seems to be persistent between reboots on ext4.
>
> That's not the problem. Trying running fallocate() to randomly
> preallocate space and punch holes in the file (without changing the
> file size) in parallel with your EVM hash calculation....
>
>> In our case i_mutex is not for FS itself, but for EVM to ensure
>> that hash over extents and its value in xattrs are stay in sync.
>>
>>
>> > i.e. fiemap provides a point in time snapshot that is out of date
>> > even before the snapshot is returned to the caller regardless of the
>> >
>>
>> So what is the way at VFS to ensure that what ever is returned in fiemap
>> stays the same over reboots?
>
> The is no way to guarantee that.
>
> Indeed, XFS can modify the extent map of an inode after the VFS has
> disposed of the inode (e.g. during unmount or memory reclaim). The
> changes after inode cache eviction mean that you can't even
> guarantee that the extent map remains unchanged from close() to
> open() a few minutes apart....
>
> Extent maps and their manipulation belongs to the filesystem. The
> VFS has no business getting involved in how the filesytem manages or
> manipulates them.
>
>> ioctl_fiemap() uses filemap_write_and_wait()...
>> And it seems to work well for me as well...
>
> filemap_write_and_wait() writes the current dirty data to disk. It
> doesn't stop other threads from dirtying the inode or the extent map
> from changing at any time.
>
>> >> EVM - Extended Verification Module: security/integrity/evm
>> >>
>> >> That is a part of integrity subsystem which provides protection
>> >> against offline modification of
>> >> a inode metadata: uid, gid, perm, ino, xattrs, etc..
>> >
>> > *groan*
>> >
>> >> EVM has several hooks, which are called when those above metadata changes.
>> >> And they are called when i_mutex is locked.
>> >> i_mutex is used to prevent racing between some security verification
>> >> and file metadata attribute update.
>> >
>> > Sounds great, until you consider the fact that inode metadata can be
>> > updated without i_mutex being held. e.g. fallocate vs fiemap.  Have
>> > you ever had a look at how many different ioctls various filesystems
>> > implement that modify metadata without needing to hold i_mutex?
>> >
>>
>> What inode metadata are you talking about?
>
> Extent maps. ioctl()s that manipulate extent maps. Filesystems
> specific inode metadata and ioctls that manipulate that. ioctls that
> don't go through VFS paths. Handle-based attribute manipulations via
> ioctls.  Invisible IO that specifically avoids changing user visible
> attributes like atime/ctime/mtime on reads and writes.
>
>> do_truncate:
>>       mutex_lock(&dentry->d_inode->i_mutex);
>>       ret = notify_change(dentry, &newattrs);
>>       mutex_unlock(&dentry->d_inode->i_mutex);
>>
>> chmod_common:
>>       error = notify_change(path->dentry, &newattrs);
>>       mutex_unlock(&inode->i_mutex);
>>
>> chown_common:
>>       mutex_lock(&inode->i_mutex);
>>       error = security_path_chown(path, user, group);
>>       if (!error)
>>               error = notify_change(path->dentry, &newattrs);
>>       mutex_unlock(&inode->i_mutex);
>>
>> It all happens with i_mutex locked.
>> They all call then
>>       if (inode->i_op->setattr)
>>               error = inode->i_op->setattr(dentry, attr);
>
> Sure. But i_mutex only specifically protects VFS maintained
> metadata. The extent map is not a VFS maintained structure - it's a
> filesystem internal detail.
>
>> >> One of the patches we are working on now would like to use extent map
>> >> information to prevent certain attack.
>> >
>> > I'm not a mind reader. What attack may that be?
>>
>> That is a simple attack of modifying inode block map so that different
>> files points to the same blocks.
>> FSCK calls them "multiply claimed blocks"...
>
> Just force users to run fsck before mounting?
>

Not good for mobile device from boot time point of view.

> Better question: why won't file data integrity checks detect such
> errors? Data integrity checks will only pass if the duplicate block
> has identical data, or someone is clever enough to find an existing
> block on disk that magically causes a hash collision. Finding a hash
> collision already on disk is so unlikely that someone who has the
> access and ability to calculate a hash collision will simply
> overwrite the file contents directly....
>

No.. that is not about a collision...

Possible attack to IMA - I have done it...

1. make a file copy.
    now we have 2 files with the same data and hash.
2. modify block map that second file points to the blocks of first...
    IMA verification succeeds, because content is the same
3. write some data to second file...
    it will also change content of first file.
4. create some low memory pressure to force kernel to drop pages.
5. open first file again.
    pages will be loaded again, but now verified by IMA, because inode
    is still in the cache and IMA still thinks that it is verified... BUM...

> Seriously, if someone can modify your block device directly then
> you've already lost and no amount of after-the-fact verification
> will save you.

Are you talking about offline or online modification?
Integrity protection against offline modification..
Online is protected by Access Control...

>
>> >> Of course as above we do it in EVM hooks which are called under i_mutex.
>> >> And more over we need to ensure atomicity when reading extent map and
>> >> updating xattrs - just the same as it works now
>> >> for other inode metadata.
>> >
>> > Fundamentally, extent maps are not protected by i_mutex, and that's
>> > where your design breaks down. What you want cannot be provided by
>> > i_mutex, and that's one of the reasons why i_mutex is not held above
>> > .fiemap - it's not appropriate locking for the operation being
>> > performed.
>>
>> Really, as I said, we are under i_mutex not to protect FS,
>> but to make sure that extent content and extended attribute stays in sync.
>
> i_mutex does not give you that guarantee.
>
> BTW, if you were concerned about integrity, you'd be running fsync
> after writing the attribute to ensure the attribute goes straight to
> disk and stays in sync with the extent map on disk, yes? You're not
> running fsync, though, are you?  I'm pretty sure because otherwise
> this conversion would be about fsync, not fiemap....
>
> Even so, the xattr modification will be a separate transaction to
> the extent map manipulation, so it's entirely possible that a crash
> between the change of the extent map and writing of the xattr can
> occur, and so you end up out of sync anyway. The only way to keep
> them in sync is to have the filesystem write the xattr in the same
> transaction that modifies the extent map.....
>
>> > Here's a beautiful example: XFS_IOC_SWAPEXT.
>> >
>> > This filesystem specific ioctl() atomically swaps the extent maps
>> > between two inodes.  And it doesn't hold i_mutex to do it, and it
>> > doesn't even inform the VFS that it has happened. Indeed, it is
>> > *completely transparent* to the VFS because it's a special interface
>> > to the filesystem used by filesystem maintenance tools, namely the
>> > online defragmenter xfs_fsr. That means running xfs_fsr will really
>> > screw with any EVM requirement that special security attributes and
>> > extent maps are kept in sync....
>> >
>>
>> That is a tough case then if VFS is unaware.
>> For integrity protection it might be a requirement not to use such
>> ways to modify file system.
>
> Ok, so you're effectively saying that what you want to do with EVM
> requires that no non-VFS based extent map manipulation will be
> allowed. So, no automatic tiering via hot-block tracking. No file
> defragmentation, automatic or manual. No automatic repair of
> detected corruption. No automatic data replication. No HSMs. No
> filesystem level data deduplication. No data restriping. No tree
> rebalancing. No filesystem specific dump/restore or send/recv.
>

Bad...

> Perhaps you might want to rethin kwhat you are trying to acheive
> with extent map checking?
>

I have to...

>> But ok.. You are aware of the problem.
>> What is your suggestion on how we could get some integrity measurement
>> at VFS layer which
>> allows to identify if inode block mapping has changed?
>
> 1. Validate the data in the file, not the metadata that the filesystem
> uses to index the data. (easy)
>
> 2. have the filesystem modify the integrity xattr in the same
> transaction that modifies the extent map. (hard, might be impossible
> for some filesytems))
>

I thought about using fiemap to eliminate attack I described.
I would not like such approach which somehow works for ext4 and not all others.

Are there any ways to detect that any of the pages have been dropped
from the kernel page cache?

- Dmitry

> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@xxxxxxxxxxxxx
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html