On Tue, Sep 06, 2016 at 02:17:08PM +0530, Ashish Sangwan wrote: > On Wed, Aug 31, 2016 at 10:34 PM, Nikolaus Rath <Nikolaus@xxxxxxxx> wrote: > > > On Aug 31 2016, Ashish Sangwan <ashishsangwan2@xxxxxxxxx> wrote: > > > In case of a write call on any file, there is a xattr lookup call for > > > security.capablities type of xattr which is a scaling bottleneck. > > > In some of our use cases, just enabling the xattr support, we are > > > experiencing a performance drop of almost 20% even though the file does > > > not have any security xattr. > > > Fuse, by default, does not remember the presence of security attributes > > as > > > it clears the MS_NOSEC flag at the time of fill super and hence requires > > a > > > lookup of security xattr at each write. This makes sense in case of > > network > > > filesystems where multiple clients can change the state of xattr. > > > This patch adds a new mount option cache_security_xattr_presence > > > to avoid clearing MS_NOSEC flag. This could be use by the filesystem > > > implementations which supports xattr but are local in nature OR the > > > implementations which has its own security policies and > > > do not support security.capablities xattr. > > > > > > If I remember correctly, FUSE does not support LSMs at all, so even if > > there is a security.capabilities xattr it won't have the expected > > effect. So maybe it makes more sense to unconditionally catch both read > > and write of security.capabilites in kernel and never forward it to > > userspace? > > > > Hi Miklos, do you have any comment about the patch or Nikolaus's advise? "security.capabilities" should work fine in fuse. Handling of that is in security/commoncaps.c, not part of any security module. I'm looking at handling of ATTR_KILL_* flags in fuse and it's a mess. It needs to be sorted out, but it's going to be more complicated than your patch. Fuse handles different models for filesystems, and unfortunately it doesn't make a clear distintion between the two, resulting in all sorts of "interesting" bugs. 1) Local filesystem (ntfs3g, etc). The VFS cares for this type of fs very well, since the cached state is always consistent with the actual filesystem state. In this case we can safely turn on MS_NOSEC caching. Using the "fuseblk" filesystem type should be a reliable indication of this mode. Non-fuseblk filesystems can also be local. They would be setting attribute and entry timeouts to large values and setting FOPEN_KEEP_CACHE in open to get maximal attribute and data caching. But there's no single, reliable indication of this mode, so (perhaps with the exception of "fuseblk") we have to assume a distributed filesystem. 2) Distributed filesystem (gluster, etc). The filesystem can be modified externally, hence the cached state can become out of date with the actual filesystem state. For this case we need to be careful with MS_NOSEC. Not only that, but fuse allowed VFS to set mode in setattr in order to clear suid/sgid on chown and truncate, and (since writeback_cache) write. The problem with this is that it'll potentially set a stale mode. E.g.: host1: chmod 4755 foo host2: stat foo > /dev/null host1: chmod 4700 foo host2: chown 1:1 foo host2: stat -c%a bar 755 See the problem? The poper fix would be to let the filesystems do the suid/sgid clearing on the relevant operations. Possibly some are already doing it (if the filesystem just forwards operations to a real underlying fs then it will just work). So we need a way to know if the filesystem is clearing privs (suid/sgid/cap) on chown, truncate and write. A) It is clearing privs. We can set MS_NOSEC and ignore ATTR_KILL_* in fuse_setxattr(). There's a special case, though: A*) writeback_cache: In this case WRITE requests won't be sent to the filesystem immediately on a write() so we still need to do getattr/getxattr in write to determine if privs need to be cleared. But we can use the attr timeout to limit the rate. B) It isn't clearing privs. This is the default, unless filesystem indicates otherwise, we must assmue this even though the filesystem may actually be clearing privs. In this case we must remove "security.capability" before doing chown, truncate or write (this is the current state). But we also need to manually kill suid/sgit in a less racy way. Solution: - If ATTR_KILL_* is set in fuse_setattr(), then update the attributes and recalculate the mode to set, to reduce the race window. This is a pretty simple and harmless change. - Refresh the attributes if timeout has expired and recheck suid/sgid. This will result in more correct operation, but also may cause performance regression in case the attribute timeout is zero or very small. If it does cause a performance regression, that will at least make the filesystem writers consider moving to model 1 or model 2/A. Still, this is a risky change, because we don't want to generally break working setups on new kernels, and this is a very obscure corner case, which probably not many care about. I'm not sure about the risk/benefit ratio here... So to conclude: - We need some fixes to the default behavior. - We need a way for filesystem to tell the kernel module if it is taking care of clearing privileges for write/truncate/chown (FUSE_HANDLE_KILLPRIV). - We need a way for filesystem to tell the kernel module if it is allowing caching of xattrs or lack thereof (timeout in fuse_getxattr_out). - Perhaps add a "local fs" mode where we can assume proper consistency between cache and backing. Thanks, Miklos -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html