On 8/12/20 3:15 PM, Daniel Burgener wrote:
In order to avoid concurrency issues around selinuxfs resource availability
during policy load, we first create new directories out of tree for
reloaded resources, then swap them in, and finally delete the old versions.
This fix focuses on concurrency in each of the three subtrees swapped, and
not concurrency across the three trees. This means that it is still possible
that subsequent reads to eg the booleans directory and the class directory
during a policy load could see the old state for one and the new for the other.
The problem of ensuring that policy loads are fully atomic from the perspective
of userspace is larger than what is dealt with here. This commit focuses on
ensuring that the directories contents always match either the new or the old
policy state from the perspective of userspace.
In the previous implementation, on policy load /sys/fs/selinux is updated
by deleting the previous contents of
/sys/fs/selinux/{class,booleans} and then recreating them. This means
that there is a period of time when the contents of these directories do not
exist which can cause race conditions as userspace relies on them for
information about the policy. In addition, it means that error recovery in
the event of failure is challenging.
In order to demonstrate the race condition that this series fixes, you
can use the following commands:
while true; do cat /sys/fs/selinux/class/service/perms/status
/dev/null; done &
while true; do load_policy; done;
In the existing code, this will display errors fairly often as the class
lookup fails. (In normal operation from systemd, this would result in a
permission check which would be allowed or denied based on policy settings
around unknown object classes.) After applying this patch series you
should expect to no longer see such error messages.
Signed-off-by: Daniel Burgener <dburgener@xxxxxxxxxxxxxxxxxxx>
---
security/selinux/selinuxfs.c | 145 +++++++++++++++++++++++++++++------
1 file changed, 120 insertions(+), 25 deletions(-)
diff --git a/security/selinux/selinuxfs.c b/security/selinux/selinuxfs.c
index f09afdb90ddd..d3a19170210a 100644
--- a/security/selinux/selinuxfs.c
+++ b/security/selinux/selinuxfs.c
+ tmp_policycap_dir = sel_make_dir(tmp_parent, POLICYCAP_DIR_NAME, &fsi->last_ino);
+ if (IS_ERR(tmp_policycap_dir)) {
+ ret = PTR_ERR(tmp_policycap_dir);
+ goto out;
+ }
No need to re-create this one.
- return 0;
+ // booleans
+ old_dentry = fsi->bool_dir;
+ lock_rename(tmp_bool_dir, old_dentry);
+ ret = vfs_rename(tmp_parent->d_inode, tmp_bool_dir, fsi->sb->s_root->d_inode,
+ fsi->bool_dir, NULL, RENAME_EXCHANGE);
One issue with using vfs_rename() is that it will trigger all of the
permission checks associated with renaming, and previously this was
never required for selinuxfs and therefore might not be allowed in some
policies even to a process allowed to reload policy. So if you need to
do this, you may want to override creds around this call to use the init
cred (which will still require allowing it to the kernel domain but not
necessarily to the process that is performing the policy load). The
other issue is that you then have to implement a rename inode operation
and thus technically it is possible for userspace to also attempt
renames on selinuxfs to the extent allowed by policy. I see that
debugfs has a debugfs_rename() that internally uses simple_rename() but
I guess that doesn't cover the RENAME_EXCHANGE case.
+ // Since the other temporary dirs are children of tmp_parent
+ // this will handle all the cleanup in the case of a failure before
+ // the swapover
Don't use // style comments please, especially not for multi-line
comments. I think they are only used in selinux for the
script-generated license lines.
+static struct dentry *sel_make_disconnected_dir(struct super_block *sb,
+ unsigned long *ino)
+{
+ struct inode *inode = sel_make_inode(sb, S_IFDIR | S_IRUGO | S_IXUGO);
+
+ if (!inode)
+ return ERR_PTR(-ENOMEM);
+
+ inode->i_op = &sel_dir_inode_operations;
+ inode->i_fop = &simple_dir_operations;
+ inode->i_ino = ++(*ino);
+ /* directory inodes start off with i_nlink == 2 (for "." entry) */
+ inc_nlink(inode);
+ return d_obtain_alias(inode);
+}
+
Since you are always incrementing the last_ino counter and never reusing
the ones for the removed inodes, you could technically eventually end up
with one of these directories have the same inode number as one of the
inodes whose inode numbers are generated in specific ranges (i.e. for
initial_contexts, booleans, classes, and policy capabilities). Optimally
we'd just reuse the inode number for the inode we are replacing?