On Mon, Jun 1, 2015 at 12:47 PM, Tycho Andersen <tycho.andersen@xxxxxxxxxxxxx> wrote: > On Mon, Jun 01, 2015 at 12:38:57PM -0700, Andy Lutomirski wrote: >> On Mon, Jun 1, 2015 at 12:28 PM, Tycho Andersen >> <tycho.andersen@xxxxxxxxxxxxx> wrote: >> > This patch is the first step in enabling checkpoint/restore of processes >> > with seccomp enabled. >> > >> > One of the things CRIU does while dumping tasks is inject code into them >> > via ptrace to collect information that is only available to the process >> > itself. However, if we are in a seccomp mode where these processes are >> > prohibited from making these syscalls, then what CRIU does kills the task. >> > >> > This patch adds a new ptrace command, PTRACE_SUSPEND_SECCOMP that enables a >> > task from the init user namespace which has CAP_SYS_ADMIN to disable (and >> > re-enable) seccomp filters for another task so that they can be >> > successfully dumped (and restored). >> > >> > Signed-off-by: Tycho Andersen <tycho.andersen@xxxxxxxxxxxxx> >> > CC: Kees Cook <keescook@xxxxxxxxxxxx> >> > CC: Andy Lutomirski <luto@xxxxxxxxxxxxxx> >> > CC: Will Drewry <wad@xxxxxxxxxxxx> >> > CC: Roland McGrath <roland@xxxxxxxxxxxxx> >> > CC: Oleg Nesterov <oleg@xxxxxxxxxx> >> > CC: Pavel Emelyanov <xemul@xxxxxxxxxxxxx> >> > CC: Serge E. Hallyn <serge.hallyn@xxxxxxxxxx> >> > --- >> > include/linux/seccomp.h | 8 ++++++ >> > include/uapi/linux/ptrace.h | 1 + >> > kernel/ptrace.c | 10 ++++++++ >> > kernel/seccomp.c | 62 ++++++++++++++++++++++++++++++++++++++++++++- >> > 4 files changed, 80 insertions(+), 1 deletion(-) >> > >> > diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h >> > index a19ddac..7cc870f 100644 >> > --- a/include/linux/seccomp.h >> > +++ b/include/linux/seccomp.h >> > @@ -25,6 +25,9 @@ struct seccomp_filter; >> > struct seccomp { >> > int mode; >> > struct seccomp_filter *filter; >> > +#ifdef CONFIG_CHECKPOINT_RESTORE >> > + bool suspended; >> > +#endif >> > }; >> > >> > #ifdef CONFIG_HAVE_ARCH_SECCOMP_FILTER >> > @@ -53,6 +56,11 @@ static inline int seccomp_mode(struct seccomp *s) >> > return s->mode; >> > } >> > >> > +#ifdef CONFIG_CHECKPOINT_RESTORE >> > +extern int suspend_seccomp(struct task_struct *); >> > +extern int resume_seccomp(struct task_struct *); >> > +#endif >> > + >> > #else /* CONFIG_SECCOMP */ >> > >> > #include <linux/errno.h> >> > diff --git a/include/uapi/linux/ptrace.h b/include/uapi/linux/ptrace.h >> > index cf1019e..8ba4e4f 100644 >> > --- a/include/uapi/linux/ptrace.h >> > +++ b/include/uapi/linux/ptrace.h >> > @@ -17,6 +17,7 @@ >> > #define PTRACE_CONT 7 >> > #define PTRACE_KILL 8 >> > #define PTRACE_SINGLESTEP 9 >> > +#define PTRACE_SUSPEND_SECCOMP 10 >> > >> > #define PTRACE_ATTACH 16 >> > #define PTRACE_DETACH 17 >> > diff --git a/kernel/ptrace.c b/kernel/ptrace.c >> > index c8e0e05..a6b6527 100644 >> > --- a/kernel/ptrace.c >> > +++ b/kernel/ptrace.c >> > @@ -15,6 +15,7 @@ >> > #include <linux/highmem.h> >> > #include <linux/pagemap.h> >> > #include <linux/ptrace.h> >> > +#include <linux/seccomp.h> >> > #include <linux/security.h> >> > #include <linux/signal.h> >> > #include <linux/uio.h> >> > @@ -1003,6 +1004,15 @@ int ptrace_request(struct task_struct *child, long request, >> > break; >> > } >> > #endif >> > + >> > +#if defined(CONFIG_SECCOMP) && defined(CONFIG_CHECKPOINT_RESTORE) >> > + case PTRACE_SUSPEND_SECCOMP: >> > + if (data) >> > + return suspend_seccomp(child); >> > + else >> > + return resume_seccomp(child); >> > +#endif >> > + >> > default: >> > break; >> > } >> > diff --git a/kernel/seccomp.c b/kernel/seccomp.c >> > index 980fd26..a358a58 100644 >> > --- a/kernel/seccomp.c >> > +++ b/kernel/seccomp.c >> > @@ -569,6 +569,7 @@ static int mode1_syscalls_32[] = { >> > static void __secure_computing_strict(int this_syscall) >> > { >> > int *syscall_whitelist = mode1_syscalls; >> > + >> > #ifdef CONFIG_COMPAT >> > if (is_compat_task()) >> > syscall_whitelist = mode1_syscalls_32; >> > @@ -590,6 +591,11 @@ void secure_computing_strict(int this_syscall) >> > { >> > int mode = current->seccomp.mode; >> > >> > +#ifdef CONFIG_CHECKPOINT_RESTORE >> > + if (current->seccomp.suspended) >> > + return; >> > +#endif >> >> IMO it's unfortunate that this has any runtime overhead at all. Can >> it be suspend be multiplexed into mode to avoid this? >> >> > >> > +#ifdef CONFIG_CHECKPOINT_RESTORE >> > + if (unlikely(current->seccomp.suspended)) >> > + return SECCOMP_PHASE1_OK; >> > +#endif >> > + >> >> Ditto. >> >> > @@ -769,7 +780,8 @@ static long seccomp_set_mode_strict(void) >> > goto out; >> > >> > #ifdef TIF_NOTSC >> > - disable_TSC(); >> > + if (!current->seccomp.suspended) >> > + disable_TSC(); >> > #endif >> >> If you get here with seccomp suspended, you have already utterly >> failed. Please don't try to pretend that you're going to do something >> sensible if this happens. >> >> CRIU needs to freeze the target task reliably before suspending >> seccomp to avoid massive security holes. > > Sorry, I should probably have mentioned this. It actually useful on > restore: the problem is that CRIU maps its restorer code into memory, > does all of the necessary work to restore (including restore the > seccomp stuff), and then tries to unmap that blob and gets killed. > > So, the criu restore procedure is to suspend seccomp, restore the > seccomp state, and then let criu resume the seccomp state just before > the final task resume. > >> > seccomp_assign_mode(current, seccomp_mode); >> > ret = 0; >> > @@ -901,3 +913,51 @@ long prctl_set_seccomp(unsigned long seccomp_mode, char __user *filter) >> > /* prctl interface doesn't have flags, so they are always zero. */ >> > return do_seccomp(op, 0, uargs); >> > } >> > + >> > +#ifdef CONFIG_CHECKPOINT_RESTORE >> > +int suspend_seccomp(struct task_struct *task) >> > +{ >> > + int ret = -EACCES; >> > + >> > + spin_lock_irq(&task->sighand->siglock); >> > + >> > + if (!capable(CAP_SYS_ADMIN)) >> > + goto out; >> > + >> > + task->seccomp.suspended = true; >> > + >> > +#ifdef TIF_NOTSC >> > + if (task->seccomp.mode == SECCOMP_MODE_STRICT) >> > + clear_tsk_thread_flag(task, TIF_NOTSC); >> > +#endif >> >> And what if the task is running? >> >> > + >> > + ret = 0; >> > +out: >> > + spin_unlock_irq(&task->sighand->siglock); >> > + >> > + return ret; >> > +} >> > + >> > +int resume_seccomp(struct task_struct *task) >> > +{ >> > + int ret = -EACCES; >> > + >> > + spin_lock_irq(&task->sighand->siglock); >> > + >> > + if (!capable(CAP_SYS_ADMIN)) >> > + goto out; >> > + >> > + task->seccomp.suspended = false; >> > + >> > +#ifdef TIF_NOTSC >> > + if (task->seccomp.mode == SECCOMP_MODE_STRICT) >> > + set_tsk_thread_flag(task, TIF_NOTSC); >> > +#endif >> >> Ditto. Or can the task not be running here? > > It is stopped since ptrace requires it to be stopped; I don't know if > that's enough to guarantee correctness, though. Is there some > additional barrier that is needed? Dunno. Does ptrace actually guarantee that for new operations? In any event, I'm not sure I see why you need to manipulate NOTSC like this at all. What goes wrong if you let the NOTSC take effect immediately? --Andy -- To unsubscribe from this list: send the line "unsubscribe linux-api" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html