Re: [PATCH] Fuse: Add mount option to cache presence of security related xattr

Miklos Szeredi <miklos@xxxxxxxxxx> · Thu, 15 Sep 2016 16:15:57 +0200

On Tue, Sep 06, 2016 at 02:17:08PM +0530, Ashish Sangwan wrote:
> On Wed, Aug 31, 2016 at 10:34 PM, Nikolaus Rath <Nikolaus@xxxxxxxx> wrote:
> 
> > On Aug 31 2016, Ashish Sangwan <ashishsangwan2@xxxxxxxxx> wrote:
> > > In case of a write call on any file, there is a xattr lookup call for
> > > security.capablities type of xattr which is a scaling bottleneck.
> > > In some of our use cases, just enabling the xattr support, we are
> > > experiencing a performance drop of almost 20% even though the file does
> > > not have any security xattr.
> > > Fuse, by default, does not remember the presence of security attributes
> > as
> > > it clears the MS_NOSEC flag at the time of fill super and hence requires
> > a
> > > lookup of security xattr at each write. This makes sense in case of
> > network
> > > filesystems where multiple clients can change the state of xattr.
> > > This patch adds a new mount option cache_security_xattr_presence
> > > to avoid clearing MS_NOSEC flag. This could be use by the filesystem
> > > implementations which supports xattr but are local in nature OR the
> > > implementations which has its own security policies and
> > > do not support security.capablities xattr.
> >
> >
> > If I remember correctly, FUSE does not support LSMs at all, so even if
> > there is a security.capabilities xattr it won't have the expected
> > effect. So maybe it makes more sense to unconditionally catch both read
> > and write of security.capabilites in kernel and never forward it to
> > userspace?
> >
> 
> Hi Miklos, do you have any comment about the patch or Nikolaus's advise?

"security.capabilities" should work fine in fuse.  Handling of that is in
security/commoncaps.c, not part of any security module.

I'm looking at handling of ATTR_KILL_* flags in fuse and it's a mess.  It needs
to be sorted out, but it's going to be more complicated than your patch.

Fuse handles different models for filesystems, and unfortunately it doesn't make
a clear distintion between the two, resulting in all sorts of "interesting"
bugs.

1) Local filesystem (ntfs3g, etc).  The VFS cares for this type of fs very well,
since the cached state is always consistent with the actual filesystem state.
In this case we can safely turn on MS_NOSEC caching.  Using the "fuseblk"
filesystem type should be a reliable indication of this mode.  Non-fuseblk
filesystems can also be local.  They would be setting attribute and entry
timeouts to large values and setting FOPEN_KEEP_CACHE in open to get maximal
attribute and data caching.  But there's no single, reliable indication of this
mode, so (perhaps with the exception of "fuseblk") we have to assume a
distributed filesystem.

2) Distributed filesystem (gluster, etc).  The filesystem can be modified
externally, hence the cached state can become out of date with the actual
filesystem state.  For this case we need to be careful with MS_NOSEC.  Not only
that, but fuse allowed VFS to set mode in setattr in order to clear suid/sgid on
chown and truncate, and (since writeback_cache) write.  The problem with this is
that it'll potentially set a stale mode.  E.g.:

host1: chmod 4755 foo
host2: stat foo > /dev/null
host1: chmod 4700 foo
host2: chown 1:1 foo
host2: stat -c%a bar
755

See the problem?

The poper fix would be to let the filesystems do the suid/sgid clearing on the
relevant operations.  Possibly some are already doing it (if the filesystem just
forwards operations to a real underlying fs then it will just work).

So we need a way to know if the filesystem is clearing privs (suid/sgid/cap) on
chown, truncate and write.

A) It is clearing privs. We can set MS_NOSEC and ignore ATTR_KILL_* in
fuse_setxattr().  There's a special case, though:

A*) writeback_cache: In this case WRITE requests won't be sent to the
filesystem immediately on a write() so we still need to do getattr/getxattr in
write to determine if privs need to be cleared.  But we can use the attr timeout
to limit the rate.

B) It isn't clearing privs.  This is the default, unless filesystem indicates
otherwise, we must assmue this even though the filesystem may actually be
clearing privs.  In this case we must remove "security.capability" before doing
chown, truncate or write (this is the current state).  But we also need to
manually kill suid/sgit in a less racy way.  Solution:

- If ATTR_KILL_* is set in fuse_setattr(), then update the attributes and
  recalculate the mode to set, to reduce the race window.  This is a pretty
  simple and harmless change.

- Refresh the attributes if timeout has expired and recheck suid/sgid.  This
  will result in more correct operation, but also may cause performance
  regression in case the attribute timeout is zero or very small.  If it does
  cause a performance regression, that will at least make the filesystem writers
  consider moving to model 1 or model 2/A.  Still, this is a risky change,
  because we don't want to generally break working setups on new kernels, and
  this is a very obscure corner case, which probably not many care about.  I'm
  not sure about the risk/benefit ratio here...

So to conclude:

 - We need some fixes to the default behavior.

 - We need a way for filesystem to tell the kernel module if it is taking care
   of clearing privileges for write/truncate/chown (FUSE_HANDLE_KILLPRIV).

 - We need a way for filesystem to tell the kernel module if it is allowing
   caching of xattrs or lack thereof (timeout in fuse_getxattr_out).

 - Perhaps add a "local fs" mode where we can assume proper consistency between
   cache and backing.

Thanks,
Miklos
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html