On Mon, Jun 2, 2014 at 6:09 PM, Kees Cook <keescook@xxxxxxxxxxxx> wrote: > On Mon, Jun 2, 2014 at 5:29 PM, Andy Lutomirski <luto@xxxxxxxxxxxxxx> wrote: >> On Mon, Jun 2, 2014 at 5:14 PM, Kees Cook <keescook@xxxxxxxxxxxx> wrote: >>> On Mon, Jun 2, 2014 at 1:53 PM, Andy Lutomirski <luto@xxxxxxxxxxxxxx> wrote: >>>> On Tue, May 27, 2014 at 12:55 PM, Kees Cook <keescook@xxxxxxxxxxxx> wrote: >>>>> On Tue, May 27, 2014 at 12:27 PM, Andy Lutomirski <luto@xxxxxxxxxxxxxx> wrote: >>>>>> On Tue, May 27, 2014 at 12:23 PM, Kees Cook <keescook@xxxxxxxxxxxx> wrote: >>>>>>> On Tue, May 27, 2014 at 12:10 PM, Andy Lutomirski <luto@xxxxxxxxxxxxxx> wrote: >>>>>>>> On Tue, May 27, 2014 at 11:45 AM, Kees Cook <keescook@xxxxxxxxxxxx> wrote: >>>>>>>>> On Tue, May 27, 2014 at 11:40 AM, Andy Lutomirski <luto@xxxxxxxxxxxxxx> wrote: >>>>>>>>>> On Tue, May 27, 2014 at 11:24 AM, Kees Cook <keescook@xxxxxxxxxxxx> wrote: >>>>>>>>>>> On Mon, May 26, 2014 at 12:27 PM, Andy Lutomirski <luto@xxxxxxxxxxxxxx> wrote: >>>>>>>>>>>> On Fri, May 23, 2014 at 10:05 AM, Kees Cook <keescook@xxxxxxxxxxxx> wrote: >>>>>>>>>>>>> On Thu, May 22, 2014 at 4:11 PM, Andy Lutomirski <luto@xxxxxxxxxxxxxx> wrote: >>>>>>>>>>>>>> On Thu, May 22, 2014 at 4:05 PM, Kees Cook <keescook@xxxxxxxxxxxx> wrote: >>>>>>>>>>>>>>> Applying restrictive seccomp filter programs to large or diverse >>>>>>>>>>>>>>> codebases often requires handling threads which may be started early in >>>>>>>>>>>>>>> the process lifetime (e.g., by code that is linked in). While it is >>>>>>>>>>>>>>> possible to apply permissive programs prior to process start up, it is >>>>>>>>>>>>>>> difficult to further restrict the kernel ABI to those threads after that >>>>>>>>>>>>>>> point. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> This change adds a new seccomp extension action for synchronizing thread >>>>>>>>>>>>>>> group seccomp filters and a prctl() for accessing that functionality, >>>>>>>>>>>>>>> as well as a flag for SECCOMP_EXT_ACT_FILTER to perform sync at filter >>>>>>>>>>>>>>> installation time. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> When calling prctl(PR_SECCOMP_EXT, SECCOMP_EXT_ACT, SECCOMP_EXT_ACT_FILTER, >>>>>>>>>>>>>>> flags, filter) with flags containing SECCOMP_FILTER_TSYNC, or when calling >>>>>>>>>>>>>>> prctl(PR_SECCOMP_EXT, SECCOMP_EXT_ACT, SECCOMP_EXT_ACT_TSYNC, 0, 0), it >>>>>>>>>>>>>>> will attempt to synchronize all threads in current's threadgroup to its >>>>>>>>>>>>>>> seccomp filter program. This is possible iff all threads are using a filter >>>>>>>>>>>>>>> that is an ancestor to the filter current is attempting to synchronize to. >>>>>>>>>>>>>>> NULL filters (where the task is running as SECCOMP_MODE_NONE) are also >>>>>>>>>>>>>>> treated as ancestors allowing threads to be transitioned into >>>>>>>>>>>>>>> SECCOMP_MODE_FILTER. If prctrl(PR_SET_NO_NEW_PRIVS, ...) has been set on the >>>>>>>>>>>>>>> calling thread, no_new_privs will be set for all synchronized threads too. >>>>>>>>>>>>>>> On success, 0 is returned. On failure, the pid of one of the failing threads >>>>>>>>>>>>>>> will be returned, with as many filters installed as possible. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Is there a use case for adding a filter and synchronizing filters >>>>>>>>>>>>>> being separate operations? If not, I think this would be easier to >>>>>>>>>>>>>> understand and to use if there was just a single operation. >>>>>>>>>>>>> >>>>>>>>>>>>> Yes: if the other thread's lifetime is not well controlled, it's good >>>>>>>>>>>>> to be able to have a distinct interface to retry the thread sync that >>>>>>>>>>>>> doesn't require adding "no-op" filters. >>>>>>>>>>>> >>>>>>>>>>>> Wouldn't this still be solved by: >>>>>>>>>>>> >>>>>>>>>>>> seccomp_add_filter(final_filter, SECCOMP_FILTER_ALL_THREADS); >>>>>>>>>>>> >>>>>>>>>>>> the idea would be that, if seccomp_add_filter fails, then you give up >>>>>>>>>>>> and, if it succeeds, then you're done. It shouldn't fail unless out >>>>>>>>>>>> of memory or you've nested too deeply. >>>>>>>>>>> >>>>>>>>>>> I wanted to keep the case of being able to to wait for non-ancestor >>>>>>>>>>> threads to finish. For example, 2 threads start and set separate >>>>>>>>>>> filters. 1 does work and exits, 2 starts another thread (3) which adds >>>>>>>>>>> filters, does work, and then waits for 1 to finish by calling TSYNC. >>>>>>>>>>> Once 1 dies, TSYNC succeeds. In the case of not having direct control >>>>>>>>>>> over thread lifetime (say, when using third-party libraries), I'd like >>>>>>>>>>> to retain the flexibility of being able to do TSYNC without needing a >>>>>>>>>>> filter being attached to it. >>>>>>>>>> >>>>>>>>>> I must admit this strikes me as odd. What's the point of having a >>>>>>>>>> thread set a filter if it intends to be a short-lived thread? >>>>>>>>> >>>>>>>>> I was illustrating the potential insanity of third-party libraries. >>>>>>>>> There isn't much sense in that behavior, but if it exists, working >>>>>>>>> around it is harder without the separate TSYNC-only call. >>>>>>>>> >>>>>>>>>> In any case, I must have missed the ability for TSYNC to block. Hmm. >>>>>>>>>> That seems complicated, albeit potentially useful. >>>>>>>>> >>>>>>>>> Oh, no, I didn't mean to imply TSYNC should block. I meant that thread >>>>>>>>> 3 could do: >>>>>>>>> >>>>>>>>> while (TSYNC-fails) >>>>>>>>> wait-on-or-kill-unexpected-thread >>>>>>>>> >>>>>>>> >>>>>>>> Ok. >>>>>>>> >>>>>>>> I'm still not seeing the need for a separate TSYNC option, though -- >>>>>>>> just add-a-filter-across-all-threads would work if it failed >>>>>>>> harmlessly on error. FWIW, TSYNC is probably equivalent to adding an >>>>>>>> always-accept filter across all threads, although no one should really >>>>>>>> do the latter for efficiency reasons. >>>>>>> >>>>>>> Given the complexity of the locking, "fail" means "I applied the >>>>>>> change to all threads except for at least this one: *error code*", >>>>>>> which means looping with the "add-a-filter" method means all the other >>>>>>> threads keep getting filters added until there is full success. I >>>>>>> don't want that overhead, so this keeps TSYNC distinctly separate. >>>>>> >>>>>> Ugh, right. >>>>>> >>>>>>> >>>>>>> Because of the filter addition, when using add_filter-TSYNC, it's not >>>>>>> sensible to continue after a failure. However, using just-TSYNC allows >>>>>>> sensible re-trying. Since the environments where TSYNC intend to be >>>>>>> used in can be very weird, I really want to retain the retry ability. >>>>>> >>>>>> OK. So what's wrong with the other approach in which the >>>>>> add-to-all-threads option always succeeds? IOW, rather than requiring >>>>>> that all threads have the caller's filter as an ancestor, just add the >>>>>> requested filter to all threads. As an optimization, if the targetted >>>>>> thread has the current thread's filter as its filter, too, then the >>>>>> targetted thread can stay synchronized. >>>>>> >>>>>> That way the add filter call really is atomic. >>>>>> >>>>>> I'm not fundamentally opposed to TSYNC, but I think I'd be happier if >>>>>> the userspace interface could be kept as simple as possible. The fact >>>>>> that there's a filter hierarchy is sort of an implementation detail, I >>>>>> think. >>>>> >>>>> I'm totally on board with making this as simple as possible. :) The >>>>> corner cases are kind of horrible, though, but I think this is already >>>>> as simple as it can get. >>>>> >>>>> Externally, without the ancestry check, we can run the risk of have >>>>> unstable behavior out of a filter change. Imagine the case of a race >>>>> where a thread is adding a filter (via prctl), and the other thread >>>>> attempts to TSYNC a filter that blocks prctl. >>>>> >>>>> In the "always take the new filter" case, sometimes we get two filters >>>>> (original and TSYNCed) on the first thread, and sometimes it blows up >>>>> when it calls prctl (TSYNCed filter blocks the prctl). There's no way >>>>> for the TSYNC caller to detect who won the race. >>>>> >>>>> With the patch series as-is, losing the race results in a TSYNC >>>>> failure (ancestry doesn't match). This is immediately detectable and >>>>> the caller can the decide how to handle the situation directly. >>>>> >>>>> Regardless, I don't think the filter hierarchy is an implementation >>>>> detail -- complex filters take advantage of the hierarchies. And >>>>> keeping this hierarchy stable means the filters are simpler to >>>>> validate for correctness, etc. >>>> >>>> Hmm. >>>> >>>> Another issue: unless I'm not looking at the right version of this, I >>>> don't think that while_each_thread does what you think it does. >>> >>> I believe it to walk current's thread_group list. It starts at current >>> and walks until it sees current again. While holding task_list for >>> writing, no new threads can appear on the list. Any cloned threads not >>> yet in the thread_group list will be added once task_lock is held by >>> the cloner (kernel/fork.c after "init_task_pid(p, PIDTYPE_PID, >>> pid);"). >> >> Oh. I see. I think the intended use is to pair do_each_thread with >> while_each_thread, which has a rather different effect. >> for_each_thread might be clearer. > > I can swap it out. It doesn't seem to be any different. > >>>> I'm generally in favor of making the kernel interfaces easy to use, >>>> even at the cost of a bit of implementation complexity. Would this be >>>> simpler if you hijacked siglock to protect seccomp state? Then I >>>> think you could just make everything atomic. >>> >>> The sighand structure isn't unique to the task nor the thread group >>> (CLONE_SIGHAND will leave the same sighand attached to a new thread), >>> so it can't be used sanely here AIUI. The threadgroup_lock stuff could >>> be used here, but I felt like it was too heavy a hammer for >>> per-add-filter actions. >> >> I'm going by this comment: >> >> /* >> * NOTE! "signal_struct" does not have its own >> * locking, because a shared signal_struct always >> * implies a shared sighand_struct, so locking >> * sighand_struct is always a proper superset of >> * the locking of signal_struct. >> */ >> struct signal_struct { >> >> copy_process protects addition to signal->thread_head using siglock, >> which seems like a suitable lock for you to use. And it's there >> regardless of configuration. > > Ah, I see now that CLONE_THREAD must include CLONE_SIGHAND, and > CLONE_SIGHAND must include CLONE_VM. So, we can depend on this being a > single thread-group-wide lock! Excellent. > >>> The atomicity you're after is just avoiding the "I failed but some >>> threads may have changed" side-effect, IIUC. As in, you want "I failed >>> and no threads have changed". For that, I think we need to use >>> threadgroup_lock. However, I remain unconvinced that this is the right >>> way to go since the failure corner case needs to be handled by the >>> tsync caller in one of two ways regardless of either perfect "failed, >>> no threads changed" or existing series "failed, some threads changed": >>> retry until success or kill entire process. >> >> >> With the current patches, I think you need: >> >> add filter to this thread; >> while (trying to sync fails) { >> grumble; >> sleep a bit; >> } >> >> If the "failed, some threads changed" case goes away, then I think it can be: >> >> while (trying to add a new filter to all threads fails) { >> grumble; >> sleep a bit; >> } > > You're still looking to get rid of the stand alone tsync action. It > does seem unneeded here if the filter-add can fail without > side-effects. A new prctl or syscall is still needed to handle this, > regardless. > >> which is nicer. It could be fancier: >> >> force add new filter to all threads; (i.e. no loop) >> >> where force-add does, roughly: >> >> acquire siglock >> old = current's filter >> add the filter to this thread >> new = current's new filter >> for each other thread { >> if (thread's filter == old) >> thread's filter = new (w/ appropriate refcounting) >> else >> add filter to thread >> } > > The trick is the "w/ appropriate refcounting" part. :) Also, "add > filter to thread" means performing memory allocation, which we can't > do while holding the siglock spinlock IIUC. I remain uncomfortable > with the composition idea, because it's not really a "sync", it's a > "make sure *this* filter is attached" instead of "make sure everyone > is using my entire current filter chain". > >> IOW, if you are already adding filters to library threads, I think >> it'll be more convenient and robust to to add it everywhere than to >> exempt threads that have added their own filter. If they've added >> their own filter, we won't override it; we'll just compose. It'll be >> the same, modulo filter order, as if the library thread adds its >> filter *after* we add ours. > > This is why I don't like it: it's not sync. It's force-apply. The goal > was to be able to build up a filter tree with potentially multiple > calls, and in the future say "okay, now, make sure any threads that > got started in the middle of that are all moved forward to the same > filter chain position I'm on." > > The force-apply route means all the filters must be force-loaded, and > doesn't handle, say, a library that doesn't start a thread but does > apply a filter running, and then we apply another with force-apply, > and suddenly all the other threads are missing the filter added by the > library. This is racy, though: what if you sync before the library adds its filter? > > I think sync should be sync -- it's the desired functionality that > Chrome needs. Force-apply is different, and could be implemented at a > later time if someone wants it, but for now, I'd like a literal > thread-sync. > >> Yes, I realize there are oddities here if one of the filters blocks >> seccomp prctls, but I think there are oddities here regardless. I >> certainly see the value in the non-force variant. > > So, it sounds like if I can build the side-effect-less failure case, > you'll be happy? I really do not want to do the force-apply. Can you > live without that? Sure. I can always add it if I find myself wanting it. Are you planning on changing the locking and making sync (or add-to-all-threads-not-forced) atomic? --Andy -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html