Am 22.04.24 um 10:45 schrieb Stas Sergeev:
This flag performs the open operation with the fsuid/fsgid that
were in effect when dir_fd was opened.
This allows the process to pre-open some directories and then
change eUID (and all other UIDs/GIDs) to a less-privileged user,
retaining the ability to open/create files within these directories.
Design goal:
The idea is to provide a very light-weight sandboxing, where the
process, without the use of any heavy-weight techniques like chroot
within namespaces, can restrict the access to the set of pre-opened
directories.
This patch is just a first step to such sandboxing. If things go
well, in the future the same extension can be added to more syscalls.
These should include at least unlinkat(), renameat2() and the
not-yet-upstreamed setxattrat().
Security considerations:
To avoid sandboxing escape, this patch makes sure the restricted
lookup modes are used. Namely, RESOLVE_BENEATH or RESOLVE_IN_ROOT.
To avoid leaking creds across exec, this patch requires O_CLOEXEC
flag on a directory.
Use cases:
Virtual machines that deal with untrusted code, can use that
instead of a more heavy-weighted approaches.
Currently the approach is being tested on a dosemu2 VM.
Signed-off-by: Stas Sergeev <stsp2@xxxxxxxxx>
CC: Eric Biederman <ebiederm@xxxxxxxxxxxx>
CC: Alexander Viro <viro@xxxxxxxxxxxxxxxxxx>
CC: Andy Lutomirski <luto@xxxxxxxxxx>
CC: Christian Brauner <brauner@xxxxxxxxxx>
CC: Jan Kara <jack@xxxxxxx>
CC: Jeff Layton <jlayton@xxxxxxxxxx>
CC: Chuck Lever <chuck.lever@xxxxxxxxxx>
CC: Alexander Aring <alex.aring@xxxxxxxxx>
CC: linux-fsdevel@xxxxxxxxxxxxxxx
CC: linux-kernel@xxxxxxxxxxxxxxx
CC: Paolo Bonzini <pbonzini@xxxxxxxxxx>
CC: Christian Göttsche <cgzones@xxxxxxxxxxxxxx>
---
fs/file_table.c | 2 ++
fs/internal.h | 2 +-
fs/namei.c | 54 ++++++++++++++++++++++++++++++++++--
fs/open.c | 2 +-
include/linux/fcntl.h | 2 ++
include/linux/fs.h | 2 ++
include/uapi/linux/openat2.h | 3 ++
7 files changed, 63 insertions(+), 4 deletions(-)
diff --git a/fs/file_table.c b/fs/file_table.c
index 4f03beed4737..9991bdd538e9 100644
--- a/fs/file_table.c
+++ b/fs/file_table.c
@@ -160,6 +160,8 @@ static int init_file(struct file *f, int flags, const struct cred *cred)
mutex_init(&f->f_pos_lock);
f->f_flags = flags;
f->f_mode = OPEN_FMODE(flags);
+ f->f_fsuid = cred->fsuid;
+ f->f_fsgid = cred->fsgid;
/* f->f_version: 0 */
/*
diff --git a/fs/internal.h b/fs/internal.h
index 7ca738904e34..692b53b19aad 100644
--- a/fs/internal.h
+++ b/fs/internal.h
@@ -169,7 +169,7 @@ static inline void sb_end_ro_state_change(struct super_block *sb)
* open.c
*/
struct open_flags {
- int open_flag;
+ u64 open_flag;
umode_t mode;
int acc_mode;
int intent;
diff --git a/fs/namei.c b/fs/namei.c
index 2fde2c320ae9..d1db6ceee4bd 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -586,6 +586,8 @@ struct nameidata {
int dfd;
vfsuid_t dir_vfsuid;
umode_t dir_mode;
+ kuid_t dir_open_fsuid;
+ kgid_t dir_open_fsgid;
} __randomize_layout;
#define ND_ROOT_PRESET 1
@@ -2414,6 +2416,8 @@ static const char *path_init(struct nameidata *nd, unsigned flags)
get_fs_pwd(current->fs, &nd->path);
nd->inode = nd->path.dentry->d_inode;
}
+ nd->dir_open_fsuid = current_cred()->fsuid;
+ nd->dir_open_fsgid = current_cred()->fsgid;
I'm wondering if it would be better to capture the whole cred structure.
Similar to io_register_personality(), which uses get_current_cred().
Only using uid and gid, won't reflect any group memberships or capabilities...
metze