[RFC,PATCH 2/2] Documentation: prctl/seccomp_filter

Will Drewry <wad@xxxxxxxxxxxx> · Wed, 11 Jan 2012 11:25:10 -0600

Document how system call filtering with BPF works
and can be used.

Signed-off-by: Will Drewry <wad@xxxxxxxxxxxx>
---
 Documentation/prctl/seccomp_filter.txt |  159 ++++++++++++++++++++++++++++++++
 1 files changed, 159 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/prctl/seccomp_filter.txt

diff --git a/Documentation/prctl/seccomp_filter.txt b/Documentation/prctl/seccomp_filter.txt
new file mode 100644
index 0000000..5fb3f44
--- /dev/null
+++ b/Documentation/prctl/seccomp_filter.txt
@@ -0,0 +1,159 @@
+		Seccomp filtering
+		=================
+
+Introduction
+------------
+
+A large number of system calls are exposed to every userland process
+with many of them going unused for the entire lifetime of the process.
+As system calls change and mature, bugs are found and eradicated.  A
+certain subset of userland applications benefit by having a reduced set
+of available system calls.  The resulting set reduces the total kernel
+surface exposed to the application.  System call filtering is meant for
+use with those applications.
+
+Seccomp filtering provides a means for a process to specify a filter
+for incoming system calls.  The filter is expressed as a Berkeley Packet
+Filter program, as with socket filters, except that the data operated on
+is the current user_regs_struct.  This allows for expressive filtering
+of system calls using the pre-existing system call ABI and using a filter
+program language with a long history of being exposed to userland.
+Additionally, BPF makes it impossible for users of seccomp to fall prey to
+time-of-check-time-of-use (TOCTOU) attacks that are common in system call
+interposition frameworks because the evaluated data is solely register state
+just after system call entry.
+
+What it isn't
+-------------
+
+System call filtering isn't a sandbox.  It provides a clearly defined
+mechanism for minimizing the exposed kernel surface.  Beyond that,
+policy for logical behavior and information flow should be managed with
+a combinations of other system hardening techniques and, potentially, a
+LSM of your choosing.  Expressive, dynamic filters provide further options down
+this path (avoiding pathological sizes or selecting which of the multiplexed
+system calls in socketcall() is allowed, for instance) which could be
+construed, incorrectly, as a more complete sandboxing solution.
+
+Usage
+-----
+
+An additional seccomp mode is added, but they are not directly set by the
+consuming process.  The new mode, '2', is only available if
+CONFIG_SECCOMP_FILTER is set and enabled using prctl with the
+PR_ATTACH_SECCOMP_FILTER argument.
+
+Interacting with seccomp filters is done using one prctl(2) call.
+
+PR_ATTACH_SECCOMP_FILTER:
+	Allows the specification of a new filter using a BPF program.
+	The BPF program will be executed over a user_regs_struct data
+	reflecting system call time except with the system call number
+	resident in orig_[register].  To allow a system call, the size
+	of the data must be returned.  At present, all other return values
+	result in the system call being blocked, but it is recommended to
+	return 0 in those cases.  This will allow for future custom return
+	values to be introduced, if ever desired.
+
+	Usage:
+		prctl(PR_ATTACH_SECCOMP_FILTER, prog);
+
+	The 'prog' argument is a pointer to a struct sock_fprog which will
+	contain the filter program.  If the program is invalid, the call
+	will return -1 and set errno to -EINVAL.
+
+	The struct user_regs_struct the @prog will see is based on the
+	personality of the task at the time of this prctl call.  Additionally,
+	is_compat_task is also tracked for the @prog.  This means that once set
+	the calling task will have all of its system calls blocked if it
+	switches its system call ABI (via personality or other means).
+
+	If the @prog is installed while the task has CAP_SYS_ADMIN in its user
+	namespace, the @prog will be marked as inheritable across execve.  Any
+	inherited filters are still subject to the system call ABI constraints
+	above and any ABI mismatched system calls will result in process death.
+
+All of the above calls return 0 on success and non-zero on error.
+
+
+Example
+-------
+
+Assume a process would like to cleanly read and write to stdin/out/err and exit
+cleanly.  Without using a BPF compiler, it may be done as follows on x86 32-bit:
+
+#include <asm/unistd.h>
+#include <linux/filter.h>
+#include <stdio.h>
+#include <stddef.h>
+#include <sys/user.h>
+#include <unistd.h>
+
+#define regoffset(_reg) (offsetof(struct user_regs_struct, _reg))
+int install_filter(void)
+{
+	struct sock_filter filter[] = {
+		/* Grab the system call number */
+		BPF_STMT(BPF_LD+BPF_W+BPF_IND, regoffset(orig_eax)),
+		/* Jump table for the allowed syscalls */
+		BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, __NR_rt_sigreturn, 10, 0),
+		BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, __NR_sigreturn, 9, 0),
+		BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, __NR_exit_group, 8, 0),
+		BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, __NR_exit, 7, 0),
+		BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, __NR_read, 1, 0),
+		BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, __NR_write, 2, 6),
+
+		/* Check that read is only using stdin. */
+		BPF_STMT(BPF_LD+BPF_W+BPF_IND, regoffset(ebx)),
+		BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, STDIN_FILENO, 3, 4),
+
+		/* Check that write is only using stdout/stderr */
+		BPF_STMT(BPF_LD+BPF_W+BPF_IND, regoffset(ebx)),
+		BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, STDOUT_FILENO, 1, 0),
+		BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, STDERR_FILENO, 0, 1),
+
+		/* Put the "accept" value in A */
+		BPF_STMT(BPF_LD+BPF_W+BPF_LEN, 0),
+
+		BPF_STMT(BPF_RET+BPF_A,0),
+	};
+	struct sock_fprog prog = {
+		.len = (unsigned short)(sizeof(filter)/sizeof(filter[0])),
+		.filter = filter,
+	};
+	if (prctl(36, &prog)) {
+		perror("prctl");
+		return 1;
+	}
+	return 0;
+}
+
+#define payload(_c) _c, sizeof(_c)
+int main(int argc, char **argv) {
+	char buf[4096];
+	ssize_t bytes = 0;
+	if (install_filter())
+		return 1;
+	syscall(__NR_write, STDOUT_FILENO, payload("OHAI! WHAT IS YOUR NAME? "));
+	bytes = syscall(__NR_read, STDIN_FILENO, buf, sizeof(buf));
+	syscall(__NR_write, STDOUT_FILENO, payload("HELLO, "));
+	syscall(__NR_write, STDOUT_FILENO, buf, bytes);
+	return 0;
+}
+
+Additionally, if prctl(2) is allowed by the installed filter, additional
+filters may be layered on which will increase evaluation time, but allow for
+further decreasing the attack surface during execution of a process.
+
+
+Caveats
+-------
+
+- execve will fail unless the most recently attached filter was installed by
+  a process with CAP_SYS_ADMIN (in its namespace).
+
+Adding architecture support
+-----------------------
+
+Any platform with seccomp support will support seccomp filters
+as long as CONFIG_SECCOMP_FILTER is enabled.
-- 
1.7.5.4

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html