On 11/08/2020 21:58, Eric W. Biederman wrote: > Mickaël Salaün <mic@xxxxxxxxxxx> writes: > >> Allow for the enforcement of the O_MAYEXEC openat2(2) flag. Thanks to >> the noexec option from the underlying VFS mount, or to the file execute >> permission, userspace can enforce these execution policies. This may >> allow script interpreters to check execution permission before reading >> commands from a file, or dynamic linkers to allow shared object >> loading. > > Ick!!!!! > > This feels like being so open minded your brains fall out. Really? :) > > I can see having a sysctl that allows the new open flag to be ignored > so that the existing lack of enforcement when the flag is passed > continues. And that could break distros (i.e. multiple interpreters, with or without O_MAYEXEC, different versions, and different kernels). > > But having the sysctl be fine grained seems like way too much rope. This follow the same rational: file permissions and mount options can change but they are controled by the sysadmin, how also configure sysctl values. > > I don't think the code needs to do more than enforce or not enforce this > logic. I think this is the more viable behavior for an eclectic set of distros (and different use cases). > > You can test the sysctl once when you process O_MAYEXEC. But code such > as may_open should not have the conditional behavior. It should get an > appropriate set of flags that are always enforced. With the madness of > what to do left at the edge of userspace. The problem is that this userspace is not in charge of the system-wide policy, the sysadmin is. As pointed out by the commit message, please take a look at the rational: https://lore.kernel.org/lkml/1477d3d7-4b36-afad-7077-a38f42322238@xxxxxxxxxxx/ > > Anything else appears to be madness, overengineering, and a failure to > separate concerns. Oh! What a conclusion! :) I'd say it is a pragmatic approach. > > Eric > > >> Add a new sysctl fs.open_mayexec_enforce to enable system administrators >> to enforce two complementary security policies according to the >> installed system: enforce the noexec mount option, and enforce >> executable file permission. Indeed, because of compatibility with >> installed systems, only system administrators are able to check that >> this new enforcement is in line with the system mount points and file >> permissions. A following patch adds documentation. >> >> Being able to restrict execution also enables to protect the kernel by >> restricting arbitrary syscalls that an attacker could perform with a >> crafted binary or certain script languages. It also improves multilevel >> isolation by reducing the ability of an attacker to use side channels >> with specific code. These restrictions can natively be enforced for ELF >> binaries (with the noexec mount option) but require this kernel >> extension to properly handle scripts (e.g., Python, Perl). To get a >> consistent execution policy, additional memory restrictions should also >> be enforced (e.g. thanks to SELinux). >> >> Because the O_MAYEXEC flag is a meant to enforce a system-wide security >> policy (but not application-centric policies), it does not make sense >> for userland to check the sysctl value. Indeed, this new flag only >> enables to extend the system ability to enforce a policy thanks to (some >> trusted) userland collaboration. Moreover, additional security policies >> could be managed by LSMs. This is a best-effort approach from the >> application developer point of view: >> https://lore.kernel.org/lkml/1477d3d7-4b36-afad-7077-a38f42322238@xxxxxxxxxxx/ >> >> Signed-off-by: Mickaël Salaün <mic@xxxxxxxxxxx> >> Reviewed-by: Thibaut Sautereau <thibaut.sautereau@xxxxxxxxxxx> >> Cc: Aleksa Sarai <cyphar@xxxxxxxxxx> >> Cc: Al Viro <viro@xxxxxxxxxxxxxxxxxx> >> Cc: Jonathan Corbet <corbet@xxxxxxx> >> Cc: Kees Cook <keescook@xxxxxxxxxxxx> >> Cc: Randy Dunlap <rdunlap@xxxxxxxxxxxxx> >> --- >> >> Changes since v6: >> * Allow opening pipes, block devices and character devices with >> O_MAYEXEC when there is no enforced policy, but forbid any non-regular >> file opened with O_MAYEXEC otherwise (i.e. for any enforced policy). >> * Add a paragraph about the non-regular files policy. >> * Move path_noexec() calls out of the fast-path (suggested by Kees >> Cook). >> >> Changes since v5: >> * Remove the static enforcement configuration through Kconfig because it >> makes the code more simple like this, and because the current sysctl >> configuration can only be set with CAP_SYS_ADMIN, the same way mount >> options (i.e. noexec) can be set. If an harden distro wants to >> enforce a configuration, it should restrict capabilities or sysctl >> configuration. Furthermore, an LSM can easily leverage O_MAYEXEC to >> fit its need. >> * Move checks from inode_permission() to may_open() and make the error >> codes more consistent according to file types (in line with a previous >> commit): opening a directory with O_MAYEXEC returns EISDIR and other >> non-regular file types may return EACCES. >> * In may_open(), when OMAYEXEC_ENFORCE_FILE is set, replace explicit >> call to generic_permission() with an artificial MAY_EXEC to avoid >> double calls. This makes sense especially when an LSM policy forbids >> execution of a file. >> * Replace the custom proc_omayexec() with >> proc_dointvec_minmax_sysadmin(), and then replace the CAP_MAC_ADMIN >> check with a CAP_SYS_ADMIN one (suggested by Kees Cook and Stephen >> Smalley). >> * Use BIT() (suggested by Kees Cook). >> * Rename variables (suggested by Kees Cook). >> * Reword the kconfig help. >> * Import the documentation patch (suggested by Kees Cook): >> https://lore.kernel.org/lkml/20200505153156.925111-6-mic@xxxxxxxxxxx/ >> * Update documentation and add LWN.net article. >> >> Changes since v4: >> * Add kernel configuration options to enforce O_MAYEXEC at build time, >> and disable the sysctl in such case (requested by James Morris). >> * Reword commit message. >> >> Changes since v3: >> * Update comment with O_MAYEXEC. >> >> Changes since v2: >> * Cosmetic changes. >> >> Changes since v1: >> * Move code from Yama to the FS subsystem (suggested by Kees Cook). >> * Make omayexec_inode_permission() static (suggested by Jann Horn). >> * Use mode 0600 for the sysctl. >> * Only match regular files (not directories nor other types), which >> follows the same semantic as commit 73601ea5b7b1 ("fs/open.c: allow >> opening only regular files during execve()"). >> --- >> Documentation/admin-guide/sysctl/fs.rst | 49 +++++++++++++++++++++++++ >> fs/namei.c | 24 ++++++++++++ >> include/linux/fs.h | 1 + >> kernel/sysctl.c | 12 +++++- >> 4 files changed, 84 insertions(+), 2 deletions(-) >> >> diff --git a/Documentation/admin-guide/sysctl/fs.rst b/Documentation/admin-guide/sysctl/fs.rst >> index 2a45119e3331..ce6e2081d3a9 100644 >> --- a/Documentation/admin-guide/sysctl/fs.rst >> +++ b/Documentation/admin-guide/sysctl/fs.rst >> @@ -37,6 +37,7 @@ Currently, these files are in /proc/sys/fs: >> - inode-nr >> - inode-state >> - nr_open >> +- open_mayexec_enforce >> - overflowuid >> - overflowgid >> - pipe-user-pages-hard >> @@ -165,6 +166,54 @@ system needs to prune the inode list instead of allocating >> more. >> >> >> +open_mayexec_enforce >> +-------------------- >> + >> +While being ignored by :manpage:`open(2)` and :manpage:`openat(2)`, the >> +``O_MAYEXEC`` flag can be passed to :manpage:`openat2(2)` to only open regular >> +files that are expected to be executable. If the file is not identified as >> +executable, then the syscall returns -EACCES. This may allow a script >> +interpreter to check executable permission before reading commands from a file, >> +or a dynamic linker to only load executable shared objects. One interesting >> +use case is to enforce a "write xor execute" policy through interpreters. >> + >> +The ability to restrict code execution must be thought as a system-wide policy, >> +which first starts by restricting mount points with the ``noexec`` option. >> +This option is also automatically applied to special filesystems such as /proc . >> +This prevents files on such mount points to be directly executed by the kernel >> +or mapped as executable memory (e.g. libraries). With script interpreters >> +using the ``O_MAYEXEC`` flag, the executable permission can then be checked >> +before reading commands from files. This makes it possible to enforce the >> +``noexec`` at the interpreter level, and thus propagates this security policy >> +to scripts. To be fully effective, these interpreters also need to handle the >> +other ways to execute code: command line parameters (e.g., option ``-e`` for >> +Perl), module loading (e.g., option ``-m`` for Python), stdin, file sourcing, >> +environment variables, configuration files, etc. According to the threat >> +model, it may be acceptable to allow some script interpreters (e.g. Bash) to >> +interpret commands from stdin, may it be a TTY or a pipe, because it may not be >> +enough to (directly) perform syscalls. >> + >> +There are two complementary security policies: enforce the ``noexec`` mount >> +option, and enforce executable file permission. These policies are handled by >> +the ``fs.open_mayexec_enforce`` sysctl (writable only with ``CAP_SYS_ADMIN``) >> +as a bitmask: >> + >> +1 - Mount restriction: checks that the mount options for the underlying VFS >> + mount do not prevent execution. >> + >> +2 - File permission restriction: checks that the to-be-opened file is marked as >> + executable for the current process (e.g., POSIX permissions). >> + >> +Note that as long as a policy is enforced, opening any non-regular file with >> +``O_MAYEXEC`` is denied (e.g. TTYs, pipe), even when such a file is marked as >> +executable or is on an executable mount point. >> + >> +Code samples can be found in tools/testing/selftests/openat2/omayexec_test.c >> +and interpreter patches (for the original O_MAYEXEC version) may be found at >> +https://github.com/clipos-archive/clipos4_portage-overlay/search?q=O_MAYEXEC . >> +See also an overview article: https://lwn.net/Articles/820000/ . >> + >> + >> overflowgid & overflowuid >> ------------------------- >> >> diff --git a/fs/namei.c b/fs/namei.c >> index 3f074ec77390..8ec13c7fd403 100644 >> --- a/fs/namei.c >> +++ b/fs/namei.c >> @@ -39,6 +39,7 @@ >> #include <linux/bitops.h> >> #include <linux/init_task.h> >> #include <linux/uaccess.h> >> +#include <linux/sysctl.h> >> >> #include "internal.h" >> #include "mount.h" >> @@ -425,6 +426,11 @@ static int sb_permission(struct super_block *sb, struct inode *inode, int mask) >> return 0; >> } >> >> +#define OPEN_MAYEXEC_ENFORCE_MOUNT BIT(0) >> +#define OPEN_MAYEXEC_ENFORCE_FILE BIT(1) >> + >> +int sysctl_open_mayexec_enforce __read_mostly; >> + >> /** >> * inode_permission - Check for access rights to a given inode >> * @inode: Inode to check permission on >> @@ -2861,11 +2867,29 @@ static int may_open(const struct path *path, int acc_mode, int flag) >> case S_IFSOCK: >> if (acc_mode & MAY_EXEC) >> return -EACCES; >> + /* >> + * Opening devices (e.g. TTYs) or pipes with O_MAYEXEC may be >> + * legitimate when there is no enforced policy. >> + */ >> + if ((acc_mode & MAY_OPENEXEC) && sysctl_open_mayexec_enforce) >> + return -EACCES; >> flag &= ~O_TRUNC; >> break; >> case S_IFREG: >> if ((acc_mode & MAY_EXEC) && path_noexec(path)) >> return -EACCES; >> + if (acc_mode & MAY_OPENEXEC) { >> + if ((sysctl_open_mayexec_enforce & OPEN_MAYEXEC_ENFORCE_MOUNT) >> + && path_noexec(path)) >> + return -EACCES; >> + if (sysctl_open_mayexec_enforce & OPEN_MAYEXEC_ENFORCE_FILE) >> + /* >> + * Because acc_mode may change here, the next and only >> + * use of acc_mode should then be by the following call >> + * to inode_permission(). >> + */ >> + acc_mode |= MAY_EXEC; >> + } >> break; >> } >> >> diff --git a/include/linux/fs.h b/include/linux/fs.h >> index 56f835c9a87a..071f37707ccc 100644 >> --- a/include/linux/fs.h >> +++ b/include/linux/fs.h >> @@ -83,6 +83,7 @@ extern int sysctl_protected_symlinks; >> extern int sysctl_protected_hardlinks; >> extern int sysctl_protected_fifos; >> extern int sysctl_protected_regular; >> +extern int sysctl_open_mayexec_enforce; >> >> typedef __kernel_rwf_t rwf_t; >> >> diff --git a/kernel/sysctl.c b/kernel/sysctl.c >> index db1ce7af2563..5008a2566e79 100644 >> --- a/kernel/sysctl.c >> +++ b/kernel/sysctl.c >> @@ -113,6 +113,7 @@ static int sixty = 60; >> >> static int __maybe_unused neg_one = -1; >> static int __maybe_unused two = 2; >> +static int __maybe_unused three = 3; >> static int __maybe_unused four = 4; >> static unsigned long zero_ul; >> static unsigned long one_ul = 1; >> @@ -888,7 +889,6 @@ static int proc_taint(struct ctl_table *table, int write, >> return err; >> } >> >> -#ifdef CONFIG_PRINTK >> static int proc_dointvec_minmax_sysadmin(struct ctl_table *table, int write, >> void *buffer, size_t *lenp, loff_t *ppos) >> { >> @@ -897,7 +897,6 @@ static int proc_dointvec_minmax_sysadmin(struct ctl_table *table, int write, >> >> return proc_dointvec_minmax(table, write, buffer, lenp, ppos); >> } >> -#endif >> >> /** >> * struct do_proc_dointvec_minmax_conv_param - proc_dointvec_minmax() range checking structure >> @@ -3264,6 +3263,15 @@ static struct ctl_table fs_table[] = { >> .extra1 = SYSCTL_ZERO, >> .extra2 = &two, >> }, >> + { >> + .procname = "open_mayexec_enforce", >> + .data = &sysctl_open_mayexec_enforce, >> + .maxlen = sizeof(int), >> + .mode = 0600, >> + .proc_handler = proc_dointvec_minmax_sysadmin, >> + .extra1 = SYSCTL_ZERO, >> + .extra2 = &three, >> + }, >> #if defined(CONFIG_BINFMT_MISC) || defined(CONFIG_BINFMT_MISC_MODULE) >> { >> .procname = "binfmt_misc",