On Thu, May 19, 2011 at 4:05 PM, Will Drewry <wad@xxxxxxxxxxxx> wrote: > On Thu, May 19, 2011 at 7:22 AM, Steven Rostedt <rostedt@xxxxxxxxxxx> wrote: >> On Wed, 2011-05-18 at 21:07 -0700, Will Drewry wrote: >> >>> Do event_* that return non-void exist in the tree at all now? I've >>> looked at the various tracepoint macros as well as some of the other >>> handlers (trace_function, perf_tp_event, etc) and I'm not seeing any >>> places where a return value is honored nor could be. At best, the >>> perf_tp_event can be short-circuited it in the hlist_for_each, but >>> it'd still need a way to bubble up a failure and result in not calling >>> the trace/event that the hook precedes. >> >> No, none of the current trace hooks have return values. That was what I >> was talking about how to implement in my previous emails. > > Led on by complete ignorance, I think I'm finally beginning to untwist > the different pieces of the tracing infrastructure. Unfortunately, > that means I took a few wrong turns along the way... > > I think function tracing looks something like this: > > ftrace_call has been injected into at a specific callsite. Upon hit: > 1. ftrace_call triggers > 2. does some checks then calls ftrace_trace_function (via mcount instrs) > 3. ftrace_trace_function may be a single func or a list. For a list it > will be: ftrace_list_func > 4. ftrace_list_func calls each registered hook for that function in a > while loop ignoring return values > 5. registered hook funcs may then track the call, farm it out to > specific sub-handlers, etc. > > This seems to be a red herring for my use case :/ though this helped > me understand your back and forth (Ingo & Steve) regarding dynamic > versus explicit events. > > > System call tracing is done via kernel/tracepoint.c events fed in via > arch/[arch]/kernel/ptrace.c where it calls trace_sys_enter. This > yields direct sys_enter and sys_exit event sources (and an event class > to hook up per-system call events). This means that > ftrace_syscall_enter() does the event prep prior to doing a filter > check comparing the ftrace_event object for the given syscall_nr to > the event data. perf_sysenter_enter() is similar but it pushes the > info over to perf_tp_event to be matched not against the global > syscall event entry, but against any sub-event in the linked list on > that syscall's event. > > Is that roughly an accurate representation of the two? I wish I > hadn't gotten distracted along the function path, but at least I > learned something (and it is relevant to the larger scope of this > thread's discussion). > > > After doing that digging, it looks like providing hook call > pre-emption and return value propagation will be a unique and fun task > for each tracer and event subsystem. If I look solely at tracepoints, > a generic change would be to make the trace_##subsys function return > an int (which I think was the event_vfs_getname proposal?). The other > option is to change the trace_sys_enter proto to include a 'int > *retcode'. > > That change would allow the propagation of some sort of policy > information. To put it to use, seccomp mode 1 could be implemented on > top of trace_syscalls. The following changes would need to happen: > 1. dummy metadata should be inserted for all unhooked system calls > 2. perf_trace_buf_submit would need to return an int or a new > TRACE_REG_SECCOMP_REGISTER handler would need to be setup in > syscall_enter_register. > 3. If perf is abused, a "kill/fail_on_discard" bit would be added to > event->attrs. > 4. perf_trace_buf_submit/perf_tp_event will return 0 for no matches, > 'n' for the number of event matches, and -EACCES/? if a > 'fail_on_discard' event is seen. > 5. perf_syscall_enter would set *retcode = perf_trace_buf_submit()'s retcode > 6. trace_sys_enter() would need to be moved to be the first entry > arch/../kernel/ptrace.c for incoming syscalls > 7. if trace_sys_enter() yields a negative return code, then > do_exit(SIGKILL) the process and return. > > Entering into seccomp mode 1 would require adding a "0" filter for > every system call number (which is why we need a dummy event call for > them since failing to check the bitmask can't be flagged > fail_on_discard) with the fail_on_discard bit. For the three calls > that are allowed, a '1' filter would be set. > > That would roughly implement seccomp mode 1. It's pretty ugly and the > fact that every system call that's disallowed has to be blacklisted is > not ideal. An alternate model would be to just use the seccomp mode > as we do today and let secure_computing() handle the return code of "# > of matches". If it the # of matches is 0, it terminates. A > 'fail_on_discard' bit then would only be good to stop further > tracepoint callback evaluation. This approach would also *almost* nix > the need to provide dummy syscall hooks. (Since sigreturn isn't > hooked on x86 because it uses ptregs fixup, a dummy would still be > needed to apply a "1" filter to.) > > Even with that tweak to move to a whitelist model, the perf event > evaluation and tracepoint callback ordering is still not guaranteed. > Without changing tracepoint itself, all other TPs will still execute. > And for perf events, it'll be first-come-first-serve until a > fail_on_discard is hit. > > After using seccomp mode=1 as the sample case to reduce scope, it's > possible to ignore it for now :) and look at the seccomp-filter/mode=2 > case. The same mechanism could be used to inject more expressive > filters. This would be done over the sys_perf_event_open() interface > assuming the new attr is added to stop perf event list enumeration. > Assuming a whitelist model, a call to prctl(PR_SET_SECCOMP, 2) would > be needed (maybe with the ON_EXEC flag option too to mirror the > event->attr on-exec bit). That would yield the ability to register > perf events for system calls then use ioctl() to SET_FILTER on them. > > Reducing the privileges of the filters after installation could be > done with another attribute bit like 'set_filter_ands'. If that is > also set on the event, and a filter is installed to ensure that > sys_perf_event_open is blocked, then it should be reasonably sane. > > I'd need to add a PERF_EVENT_IOC_GET_FILTER handler to allow > extracting the settings. > > Clearly, I haven't written the code for that yet, though making the > change for a single platform shouldn't be too much code. > > So that leaves me with some questions: > - Is this the type of reuse that was being encouraged? > - Does it really make sense to cram this through the perf interface > and events? While the changed attributes are innocuous and > potentially reusable, it seems that a large amount of the perf > facilities are exposed that could have weird side effects, but I'm > sure I still don't fully grok the perf infrastructure. > - Does extending one tracepoint to provide return values via a pointer > make sense? I'd hesitate to make all tracepoint hooks return an int by > default. > - If all tracepoints returned an int, what would the standard value > look like - or would it need to be per tracepoint impl? > - How is ambiguity resolved if multiple perf_events are registered for > a syscall with different filters? Maybe a 'stop_on_match'? though > ordering is still a problem then. > - Is there a way to affect a task-wide change without a seccomp flag > (or new task_struct entry) via the existing sys_perf_event_open > syscall? I considered suggesting a attr bit call 'secure_computing' > that when an event with the bit is enabled, it automagically forces > the task into seccomp enforcement mode, but that, again, seemed > hackish. > > While I like the idea of sharing the tracepoints infrastructure and > the trace_syscalls hooks as well as using a pre-existing interface > with very minor changes, I'm worried that the complexity of the > interface and of the infrastructure might undermine the ability to > continue meeting the desired security goals. I had originally stayed > very close to the seccomp world because of how trivial it is to review > the code and verify its accuracy/effectiveness. This approach leaves > a lot of gaps for kernel code to seep through and a fair amount of > ambiguity in what locked down syscall filters might look like. > > To me, the best part of the above is that it shows that even if we go > with a prctl SET_SECCOMP_FILTER-style interface, it is completely > certain that if a perf/ftrace-based security infrastructure is on our > future, it will be entirely compatible -- even if the prctl() > interface is just the "simpler" interface at that point somewhere down > the road. > > > Regardless, I'll hack up a proof of concept based on the outline > above. Perhaps something more elegant will emerge once I start > tweaking the source, but I'm still seeing too many gaps to be > comfortable so far. > > > [[There is a whole other approach to this too. We could continue with > the prctl() interface and mirror the trace_sys_enter model for > secure_computing(). Instead of keeping a seccomp_t-based hlist of > events, they could be stored in a new hlist for seccomp_events in > struct ftrace_event_call. The ftrace filters would be installed there > and the seccomp_syscall_enter() function could do the checks and pass > up some state data on the task's seccomp_t that indicates it needs to > do_exit(). That would likely reduce the amount of code in > seccomp_filter.c pretty seriously (though not entirely > beneficially).]] > > > Thanks again for all the feedback and insights! I really hope we can > come to an agreeable approach for implementing kernel attack surface > reduction. Just a quick follow up. I have a PoC of the perf-based system call filtering code in place which uses seccomp.mode as a task_context flag, adds require_secure and err_on_discard perf_event_attrs, and enforces these new events terminate the process in perf_syscall_enter via a return value on perf_buf_submit. This lets a call like: LD_LIBRARY_PATH=/host/home/wad/kernel.uml/tests/ /host/home/wad/kernel.uml/tests/perf record \ -S \ -e 'syscalls:sys_enter_access' \ -e 'syscalls:sys_enter_brk' \ -e 'syscalls:sys_enter_close' \ -e 'syscalls:sys_enter_exit_group' \ -e 'syscalls:sys_enter_fcntl64' \ -e 'syscalls:sys_enter_fstat64' \ -e 'syscalls:sys_enter_getdents64' \ -e 'syscalls:sys_enter_getpid' \ -e 'syscalls:sys_enter_getuid' \ -e 'syscalls:sys_enter_ioctl' \ -e 'syscalls:sys_enter_lstat64' \ -e 'syscalls:sys_enter_mmap_pgoff' \ -e 'syscalls:sys_enter_mprotect' \ -e 'syscalls:sys_enter_munmap' \ -e 'syscalls:sys_enter_open' \ -e 'syscalls:sys_enter_read' \ -e 'syscalls:sys_enter_stat64' \ -e 'syscalls:sys_enter_time' \ -e 'syscalls:sys_enter_newuname' \ -e 'syscalls:sys_enter_write' --filter "fd == 1" \ /bin/ls run successfully while omitting a syscall or passing in --filter "fd == 0" properly terminates it. (ignoring the need for execve and set_thread_area for PoC purposes). The change avoids defining a new trace call type or a huge number of internal changes and hides seccomp.mode=2 from ABI-exposure in prctl, but the attack surface is non-trivial to verify, and I'm not sure if this ABI change makes sense. It amounts to: include/linux/ftrace_event.h | 4 +- include/linux/perf_event.h | 10 +++++--- kernel/perf_event.c | 49 +++++++++++++++++++++++++++++++++++++--- kernel/seccomp.c | 8 ++++++ kernel/trace/trace_syscalls.c | 27 +++++++++++++++++----- 5 files changed, 82 insertions(+), 16 deletions(-) And can be found here: http://static.dataspill.org/perf_secure/v1/ If there is any interest at all, I can post it properly to this giant CC list. However, it's unclear to me where this thread stands. I also see a completely independent path for implementing system call filtering now, but I'll hold off posting that patch until I see where this goes. Any guidance will be appreciated - thanks again! will