Re: [PATCH v5 6/6] seccomp: add SECCOMP_EXT_ACT_TSYNC and SECCOMP_FILTER_TSYNC

Andy Lutomirski <luto@xxxxxxxxxxxxxx> · Mon, 2 Jun 2014 17:29:46 -0700

On Mon, Jun 2, 2014 at 5:14 PM, Kees Cook <keescook@xxxxxxxxxxxx> wrote:
> On Mon, Jun 2, 2014 at 1:53 PM, Andy Lutomirski <luto@xxxxxxxxxxxxxx> wrote:
>> On Tue, May 27, 2014 at 12:55 PM, Kees Cook <keescook@xxxxxxxxxxxx> wrote:
>>> On Tue, May 27, 2014 at 12:27 PM, Andy Lutomirski <luto@xxxxxxxxxxxxxx> wrote:
>>>> On Tue, May 27, 2014 at 12:23 PM, Kees Cook <keescook@xxxxxxxxxxxx> wrote:
>>>>> On Tue, May 27, 2014 at 12:10 PM, Andy Lutomirski <luto@xxxxxxxxxxxxxx> wrote:
>>>>>> On Tue, May 27, 2014 at 11:45 AM, Kees Cook <keescook@xxxxxxxxxxxx> wrote:
>>>>>>> On Tue, May 27, 2014 at 11:40 AM, Andy Lutomirski <luto@xxxxxxxxxxxxxx> wrote:
>>>>>>>> On Tue, May 27, 2014 at 11:24 AM, Kees Cook <keescook@xxxxxxxxxxxx> wrote:
>>>>>>>>> On Mon, May 26, 2014 at 12:27 PM, Andy Lutomirski <luto@xxxxxxxxxxxxxx> wrote:
>>>>>>>>>> On Fri, May 23, 2014 at 10:05 AM, Kees Cook <keescook@xxxxxxxxxxxx> wrote:
>>>>>>>>>>> On Thu, May 22, 2014 at 4:11 PM, Andy Lutomirski <luto@xxxxxxxxxxxxxx> wrote:
>>>>>>>>>>>> On Thu, May 22, 2014 at 4:05 PM, Kees Cook <keescook@xxxxxxxxxxxx> wrote:
>>>>>>>>>>>>> Applying restrictive seccomp filter programs to large or diverse
>>>>>>>>>>>>> codebases often requires handling threads which may be started early in
>>>>>>>>>>>>> the process lifetime (e.g., by code that is linked in). While it is
>>>>>>>>>>>>> possible to apply permissive programs prior to process start up, it is
>>>>>>>>>>>>> difficult to further restrict the kernel ABI to those threads after that
>>>>>>>>>>>>> point.
>>>>>>>>>>>>>
>>>>>>>>>>>>> This change adds a new seccomp extension action for synchronizing thread
>>>>>>>>>>>>> group seccomp filters and a prctl() for accessing that functionality,
>>>>>>>>>>>>> as well as a flag for SECCOMP_EXT_ACT_FILTER to perform sync at filter
>>>>>>>>>>>>> installation time.
>>>>>>>>>>>>>
>>>>>>>>>>>>> When calling prctl(PR_SECCOMP_EXT, SECCOMP_EXT_ACT, SECCOMP_EXT_ACT_FILTER,
>>>>>>>>>>>>> flags, filter) with flags containing SECCOMP_FILTER_TSYNC, or when calling
>>>>>>>>>>>>> prctl(PR_SECCOMP_EXT, SECCOMP_EXT_ACT, SECCOMP_EXT_ACT_TSYNC, 0, 0), it
>>>>>>>>>>>>> will attempt to synchronize all threads in current's threadgroup to its
>>>>>>>>>>>>> seccomp filter program. This is possible iff all threads are using a filter
>>>>>>>>>>>>> that is an ancestor to the filter current is attempting to synchronize to.
>>>>>>>>>>>>> NULL filters (where the task is running as SECCOMP_MODE_NONE) are also
>>>>>>>>>>>>> treated as ancestors allowing threads to be transitioned into
>>>>>>>>>>>>> SECCOMP_MODE_FILTER. If prctrl(PR_SET_NO_NEW_PRIVS, ...) has been set on the
>>>>>>>>>>>>> calling thread, no_new_privs will be set for all synchronized threads too.
>>>>>>>>>>>>> On success, 0 is returned. On failure, the pid of one of the failing threads
>>>>>>>>>>>>> will be returned, with as many filters installed as possible.
>>>>>>>>>>>>
>>>>>>>>>>>> Is there a use case for adding a filter and synchronizing filters
>>>>>>>>>>>> being separate operations?  If not, I think this would be easier to
>>>>>>>>>>>> understand and to use if there was just a single operation.
>>>>>>>>>>>
>>>>>>>>>>> Yes: if the other thread's lifetime is not well controlled, it's good
>>>>>>>>>>> to be able to have a distinct interface to retry the thread sync that
>>>>>>>>>>> doesn't require adding "no-op" filters.
>>>>>>>>>>
>>>>>>>>>> Wouldn't this still be solved by:
>>>>>>>>>>
>>>>>>>>>> seccomp_add_filter(final_filter, SECCOMP_FILTER_ALL_THREADS);
>>>>>>>>>>
>>>>>>>>>> the idea would be that, if seccomp_add_filter fails, then you give up
>>>>>>>>>> and, if it succeeds, then you're done.  It shouldn't fail unless out
>>>>>>>>>> of memory or you've nested too deeply.
>>>>>>>>>
>>>>>>>>> I wanted to keep the case of being able to to wait for non-ancestor
>>>>>>>>> threads to finish. For example, 2 threads start and set separate
>>>>>>>>> filters. 1 does work and exits, 2 starts another thread (3) which adds
>>>>>>>>> filters, does work, and then waits for 1 to finish by calling TSYNC.
>>>>>>>>> Once 1 dies, TSYNC succeeds. In the case of not having direct control
>>>>>>>>> over thread lifetime (say, when using third-party libraries), I'd like
>>>>>>>>> to retain the flexibility of being able to do TSYNC without needing a
>>>>>>>>> filter being attached to it.
>>>>>>>>
>>>>>>>> I must admit this strikes me as odd.  What's the point of having a
>>>>>>>> thread set a filter if it intends to be a short-lived thread?
>>>>>>>
>>>>>>> I was illustrating the potential insanity of third-party libraries.
>>>>>>> There isn't much sense in that behavior, but if it exists, working
>>>>>>> around it is harder without the separate TSYNC-only call.
>>>>>>>
>>>>>>>> In any case, I must have missed the ability for TSYNC to block.  Hmm.
>>>>>>>> That seems complicated, albeit potentially useful.
>>>>>>>
>>>>>>> Oh, no, I didn't mean to imply TSYNC should block. I meant that thread
>>>>>>> 3 could do:
>>>>>>>
>>>>>>> while (TSYNC-fails)
>>>>>>>    wait-on-or-kill-unexpected-thread
>>>>>>>
>>>>>>
>>>>>> Ok.
>>>>>>
>>>>>> I'm still not seeing the need for a separate TSYNC option, though --
>>>>>> just add-a-filter-across-all-threads would work if it failed
>>>>>> harmlessly on error.  FWIW, TSYNC is probably equivalent to adding an
>>>>>> always-accept filter across all threads, although no one should really
>>>>>> do the latter for efficiency reasons.
>>>>>
>>>>> Given the complexity of the locking, "fail" means "I applied the
>>>>> change to all threads except for at least this one: *error code*",
>>>>> which means looping with the "add-a-filter" method means all the other
>>>>> threads keep getting filters added until there is full success. I
>>>>> don't want that overhead, so this keeps TSYNC distinctly separate.
>>>>
>>>> Ugh, right.
>>>>
>>>>>
>>>>> Because of the filter addition, when using add_filter-TSYNC, it's not
>>>>> sensible to continue after a failure. However, using just-TSYNC allows
>>>>> sensible re-trying. Since the environments where TSYNC intend to be
>>>>> used in can be very weird, I really want to retain the retry ability.
>>>>
>>>> OK.  So what's wrong with the other approach in which the
>>>> add-to-all-threads option always succeeds?  IOW, rather than requiring
>>>> that all threads have the caller's filter as an ancestor, just add the
>>>> requested filter to all threads.  As an optimization, if the targetted
>>>> thread has the current thread's filter as its filter, too, then the
>>>> targetted thread can stay synchronized.
>>>>
>>>> That way the add filter call really is atomic.
>>>>
>>>> I'm not fundamentally opposed to TSYNC, but I think I'd be happier if
>>>> the userspace interface could be kept as simple as possible.  The fact
>>>> that there's a filter hierarchy is sort of an implementation detail, I
>>>> think.
>>>
>>> I'm totally on board with making this as simple as possible. :) The
>>> corner cases are kind of horrible, though, but I think this is already
>>> as simple as it can get.
>>>
>>> Externally, without the ancestry check, we can run the risk of have
>>> unstable behavior out of a filter change. Imagine the case of a race
>>> where a thread is adding a filter (via prctl), and the other thread
>>> attempts to TSYNC a filter that blocks prctl.
>>>
>>> In the "always take the new filter" case, sometimes we get two filters
>>> (original and TSYNCed) on the first thread, and sometimes it blows up
>>> when it calls prctl (TSYNCed filter blocks the prctl). There's no way
>>> for the TSYNC caller to detect who won the race.
>>>
>>> With the patch series as-is, losing the race results in a TSYNC
>>> failure (ancestry doesn't match). This is immediately detectable and
>>> the caller can the decide how to handle the situation directly.
>>>
>>> Regardless, I don't think the filter hierarchy is an implementation
>>> detail -- complex filters take advantage of the hierarchies. And
>>> keeping this hierarchy stable means the filters are simpler to
>>> validate for correctness, etc.
>>
>> Hmm.
>>
>> Another issue: unless I'm not looking at the right version of this, I
>> don't think that while_each_thread does what you think it does.
>
> I believe it to walk current's thread_group list. It starts at current
> and walks until it sees current again. While holding task_list for
> writing, no new threads can appear on the list. Any cloned threads not
> yet in the thread_group list will be added once task_lock is held by
> the cloner (kernel/fork.c after "init_task_pid(p, PIDTYPE_PID,
> pid);").

Oh.  I see.  I think the intended use is to pair do_each_thread with
while_each_thread, which has a rather different effect.
for_each_thread might be clearer.

>
>> I'm generally in favor of making the kernel interfaces easy to use,
>> even at the cost of a bit of implementation complexity.  Would this be
>> simpler if you hijacked siglock to protect seccomp state?  Then I
>> think you could just make everything atomic.
>
> The sighand structure isn't unique to the task nor the thread group
> (CLONE_SIGHAND will leave the same sighand attached to a new thread),
> so it can't be used sanely here AIUI. The threadgroup_lock stuff could
> be used here, but I felt like it was too heavy a hammer for
> per-add-filter actions.

I'm going by this comment:

/*
 * NOTE! "signal_struct" does not have its own
 * locking, because a shared signal_struct always
 * implies a shared sighand_struct, so locking
 * sighand_struct is always a proper superset of
 * the locking of signal_struct.
 */
struct signal_struct {

copy_process protects addition to signal->thread_head using siglock,
which seems like a suitable lock for you to use.  And it's there
regardless of configuration.

>
> The atomicity you're after is just avoiding the "I failed but some
> threads may have changed" side-effect, IIUC. As in, you want "I failed
> and no threads have changed". For that, I think we need to use
> threadgroup_lock. However, I remain unconvinced that this is the right
> way to go since the failure corner case needs to be handled by the
> tsync caller in one of two ways regardless of either perfect "failed,
> no threads changed" or existing series "failed, some threads changed":
> retry until success or kill entire process.

With the current patches, I think you need:

add filter to this thread;
while (trying to sync fails) {
  grumble;
  sleep a bit;
}

If the "failed, some threads changed" case goes away, then I think it can be:

while (trying to add a new filter to all threads fails) {
  grumble;
  sleep a bit;
}

which is nicer.  It could be fancier:

force add new filter to all threads;  (i.e. no loop)

where force-add does, roughly:

acquire siglock
old = current's filter
add the filter to this thread
new = current's new filter
for each other thread {
  if (thread's filter == old)
    thread's filter = new (w/ appropriate refcounting)
  else
    add filter to thread
}

IOW, if you are already adding filters to library threads, I think
it'll be more convenient and robust to to add it everywhere than to
exempt threads that have added their own filter.  If they've added
their own filter, we won't override it; we'll just compose.  It'll be
the same, modulo filter order, as if the library thread adds its
filter *after* we add ours.

Yes, I realize there are oddities here if one of the filters blocks
seccomp prctls, but I think there are oddities here regardless.  I
certainly see the value in the non-force variant.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html