Re: [RFC PATCH 3/3] fs: detect that the i_rwsem has already been taken exclusively

Mimi Zohar <zohar@xxxxxxxxxxxxxxxxxx> · Mon, 02 Oct 2017 08:09:55 -0400

On Mon, 2017-10-02 at 15:35 +1100, Dave Chinner wrote:
> On Sun, Oct 01, 2017 at 07:42:42PM -0400, Mimi Zohar wrote:
> > On Mon, 2017-10-02 at 09:34 +1100, Dave Chinner wrote:
> > > On Sun, Oct 01, 2017 at 11:41:48AM -0700, Linus Torvalds wrote:
> > > > On Sun, Oct 1, 2017 at 5:08 AM, Mimi Zohar <zohar@xxxxxxxxxxxxxxxxxx> wrote:
> > > > >
> > > > > Right, re-introducing the iint->mutex and a new i_generation field in
> > > > > the iint struct with a separate set of locks should work.  It will be
> > > > > reset if the file metadata changes (eg. setxattr, chown, chmod).
> > > > 
> > > > Note that the "inner lock" could possibly be omitted if the
> > > > invalidation can be just a single atomic instruction.
> > > > 
> > > > So particularly if invalidation could be just an atomic_inc() on the
> > > > generation count, there might not need to be any inner lock at all.
> > > > 
> > > > You'd have to serialize the actual measurement with the "read
> > > > generation count", but that should be as simple as just doing a
> > > > smp_rmb() between the "read generation count" and "do measurement on
> > > > file contents".
> > > 
> > > We already have a change counter on the inode, which is modified on
> > > any data or metadata write (i_version) under filesystem locks.  The
> > > i_version counter has well defined semantics - it's required by
> > > NFSv4 to increment on any metadata or data change - so we should be
> > > able to rely on it's behaviour to implement IMA as well. Filesystems
> > > that support i_version are marked with [SB|MS]_I_VERSION in the
> > > superblock (IS_I_VERSION(inode)) so it should be easy to tell if IMA
> > > can be supported on a specific filesystem (btrfs, ext4, fuse and xfs
> > > ATM).
> > 
> > Recently I received a patch to replace i_version with mtime/atime.
> 
> mtime is not guaranteed to change on data writes - the resolution of
> the filesystem timestamps may mean mtime only changes once a second
> regardless of the number of writes performed to that file. That's
> why NFS can't use it as a change attribute, and hence we have
> i_version....
> 
> >  Now, even more recently, I received a patch that claims that
> > i_version is just a performance improvement.
> 
> Did you ask them to explain/quantify the performance improvement?

Using i_version is a performance improvement as opposed to always
calculating the file hash and writing the xattr.  The patch is
intended for filesystems that don't support i_version (eg. ubifs).

> e.g. Using i_version on XFS slows down performance on small
> writes by 2-3% because i_version because all data writes log a
> version change rather than only logging a change when mtime updates.
> We take that penalty because NFS requires specific change attribute
> behaviour, otherwise we wouldn't have implemented it at all in
> XFS...
> 
> >  For file systems that
> > don't support i_version, assume that the file has changed.
> > 
> > For file systems that don't support i_version, instead of assuming
> > that the file has changed, we can at least use i_generation.
> 
> I'm not sure what you mean here - the struct inode already has a
> i_generation variable. It's a lifecycle indicator used to
> discriminate between alloc/free cycles on the same inode number.
> i.e. It only changes at inode allocation time, not whenever the data
> in the inode changes...

Sigh, my error.

> 
> > With Linus' suggested changes, I think this will work nicely.
> > 
> > > The IMA code should be able to sample that at measurement time and
> > > either fail or be retried if i_version changes during measurement.
> > > We can then simply make the IMA xattr write conditional on the
> > > i_version value being unchanged from the sample the IMA code passes
> > > into the filesystem once the filesystem holds all the locks it needs
> > > to write the xattr...
> > 
> > > I note that IMA already grabs the i_version in
> > > ima_collect_measurement(), so this shouldn't be too hard to do.
> > > Perhaps we don't need any new locks or counterst all, maybe just
> > > the ability to feed a version cookie to the set_xattr method?
> > 
> > The security.ima xattr is normally written out in
> > ima_check_last_writer(), not in ima_collect_measurement().
> 
> Which, if IIUC, does this to measure and update the xattr:
> 
> ima_check_last_writer
>   -> ima_update_xattr
>     -> ima_collect_measurement
>     -> ima_fix_xattr
> 
> >  ima_collect_measurement() calculates the file hash for storing in the
> > measurement list (IMA-measurement), verifying the hash/signature (IMA-
> > appraisal) already stored in the xattr, and auditing (IMA-audit).
> 
> Yup, and it samples the i_version before it calculates the hash and
> stores it in the iint, which then gets passed to ima_fix_xattr().
> Looks like all that is needed is to pass the i_version back to the
> filesystem through the xattr call....
> 
> IOWs, sample the i_version early while we hold the inode lock and
> check the writer count, then if it is the last writer drop the inode
> lock and call ima_update_xattr(). The sampled i_version then tells
> us if the file has changed before we write the updated xattr...
> 
> > The only time that ima_collect_measurement() writes the file xattr is
> > in "fix" mode.  Writing the xattr will need to be deferred until after
> > the iint->mutex is released.
> 
> ima_collect_measurement() doesn't write an xattr at all - it just
> reads the file data and calculates the hash.

There's another call to ima_fix_xattr() from ima_appraise_measurement(). 

> > There should be no open writers in ima_check_last_writer(), so the
> > file shouldn't be changing.
> 
> If that code is not holding the inode i_rwsem across
> ima_update_xattr(), then the writer check is racy as hell.  We're
> trying to get rid of the need for this code to hold the inode lock
> to stabilise the writer count for the entire operation, and it looks
> to me like everything is there to use the i_version to ensure the
> the IMA code doesn't need to hold the inode lock across
> ima_collect_measurement() and ima_fix_xattr()...

Ok

Mimi