Re: inode security state and cluster file systems

Yuri L Volobuev <volobuev@xxxxxxxxxx> · Fri, 18 Feb 2011 10:15:05 -0600

> > The issue has to do with the semantics of multi-node xattr updates.

> > The desirable behavior is simple: a change in an inode security label

> > (stored as an xattr) made on nodeA should be visible on all other

> > nodes on the next access. As far as I can tell, the current SELinux

> > code would initialize the inode security state on the first access

> > (e.g. via security_d_instantiate/inode_doinit_with_dentry), and from

> > that point on the cached security state is considered valid, until the

> > inode is destroyed or reused. Any subsequent inode_doinit_with_dentry

> > call would be a no-op, since isec->initialized is true. There's no way

> > to clear 'initialized', as far as I can see. This works when all

> > changes to the inode are local, and a local setxattr call would update

> > the inode security state. However, if the security label has been

> > changed on another node, some mechanism is needed to update the cached

> > security state. One could achieve this by using

> > security_inode_notifysecctx if the value of the security context is

> > known. However, in the general case retrieving the context value would

> > require some knowledge about the implementation details of LSM (like

> > the name of the security label xattr), and such knowledge is currently

> > kept within LSM code, and arguably should remain so. In other words,

> > one would have to resort to hacking.

> 

> Isn't this what inode_getsecctx() is for?  So that on the node where the

> setxattr() occurs, you can fetch the security context (without needing

> to know the attribute name or whether it is even implemented via zero,

> one, or many attributes), and then ship that context over the wire using

> whatever protocol you like to the other nodes.  Then on the other nodes,

> you can invoke inode_notifysecctx() as you said to update the context.

> I think that is how it works for the labeled NFS support (not yet in

> mainline).  Admittedly that is a simpler client/server model and not a

> distributed cluster model.

In principle, it's possible to use the "push mode": the node that changes the xattr would notify others, and since it has all of the information about the xattr at hand, on the receiving side inode_notifysecctx() could be used.  The "push mode" has inherent problematic properties though.  The sender could die before notifying all nodes that have legitimate interest in seeing the update, which would make it tricky to commit the xattr change to disk and notify all interested nodes without the risk of an inconsistency.  Doing this would require a synchronous RPC multicast, possibly to a large number of nodes, and that's not a good thing for performance and scalability.  For a client-server model, this makes sense, for a cluster file system, not necessarily.

Another approach to managing metadata consistency is the "pull mode": the node doing the update would acquire a distributed lock of an appropriate strength, which would make other nodes relinquish any conflicting locks they may have had and invalidate metadata protected by those locks.  The writer would then make a (logged) metadata update.  If another node needs to access the metadata object in question again, it would acquire the appropriate distributed lock (which would make the writer relinquish/downgrade its lock and flush corresponding dirty objects to disk), and read the metadata object from disk.  This is the mode that GPFS uses for metadata updates, e.g. inode attribute (mtime, uid, etc) changes.  On Linux, struct inode is transparent, so the file system code could mark its content as invalid when relinquishing the lock that protects it (and mark any corresponding dentry as needing revalidation), and then update it with the up-to-date content when a new stat or d_revalidate arrives.  One can't do the same with the inode security state, at present.

> > To remedy this situation, a new API is proposed, courtesy of Eric

> > Paris:

> > 

> > void security_inode_refresh_security(struct dentry *dentry);

> > 

> > The semantics would be similar to what SELinux inode_doinit provides:

> > for the SECURITY_FS_USE_XATTR case, inode security state would be set

> > based on the value of the security label fetched via getxattr -- even

> > if the security state is already initialized. For other labeling

> > behaviors, the call could be a no-op if security is already

> > initialized, and an equivalent of inode_doinit otherwise.

> > 

> > Does this API look useful, in particular to other cluster file

> > systems?

> 

> How do you know when to call this interface?  And if you know to call

> it, why don't you know what the new context is already?

The logical place to make this call would be at d_revalidate time.  The new value of the security context is not readily available at that time, as described above.  Technically speaking, the file system code could know the new context -- it can do a getxattr, which would return the up-to-date label content.  However, this would require knowing the security label xattr name, which is not readily available in RHEL6.  (I actually had the code working with correct semantics on RHEL5, which has security_inode_xattr_getsuffix(), but it would be fair to say that it was hacking around the API rather than leveraging it as intended).

Hope it makes things clearer.

yuri