On Thu, May 19, 2011 at 7:22 AM, Steven Rostedt <rostedt@xxxxxxxxxxx> wrote: > On Wed, 2011-05-18 at 21:07 -0700, Will Drewry wrote: > >> Do event_* that return non-void exist in the tree at all now? I've >> looked at the various tracepoint macros as well as some of the other >> handlers (trace_function, perf_tp_event, etc) and I'm not seeing any >> places where a return value is honored nor could be. At best, the >> perf_tp_event can be short-circuited it in the hlist_for_each, but >> it'd still need a way to bubble up a failure and result in not calling >> the trace/event that the hook precedes. > > No, none of the current trace hooks have return values. That was what I > was talking about how to implement in my previous emails. Led on by complete ignorance, I think I'm finally beginning to untwist the different pieces of the tracing infrastructure. Unfortunately, that means I took a few wrong turns along the way... I think function tracing looks something like this: ftrace_call has been injected into at a specific callsite. Upon hit: 1. ftrace_call triggers 2. does some checks then calls ftrace_trace_function (via mcount instrs) 3. ftrace_trace_function may be a single func or a list. For a list it will be: ftrace_list_func 4. ftrace_list_func calls each registered hook for that function in a while loop ignoring return values 5. registered hook funcs may then track the call, farm it out to specific sub-handlers, etc. This seems to be a red herring for my use case :/ though this helped me understand your back and forth (Ingo & Steve) regarding dynamic versus explicit events. System call tracing is done via kernel/tracepoint.c events fed in via arch/[arch]/kernel/ptrace.c where it calls trace_sys_enter. This yields direct sys_enter and sys_exit event sources (and an event class to hook up per-system call events). This means that ftrace_syscall_enter() does the event prep prior to doing a filter check comparing the ftrace_event object for the given syscall_nr to the event data. perf_sysenter_enter() is similar but it pushes the info over to perf_tp_event to be matched not against the global syscall event entry, but against any sub-event in the linked list on that syscall's event. Is that roughly an accurate representation of the two? I wish I hadn't gotten distracted along the function path, but at least I learned something (and it is relevant to the larger scope of this thread's discussion). After doing that digging, it looks like providing hook call pre-emption and return value propagation will be a unique and fun task for each tracer and event subsystem. If I look solely at tracepoints, a generic change would be to make the trace_##subsys function return an int (which I think was the event_vfs_getname proposal?). The other option is to change the trace_sys_enter proto to include a 'int *retcode'. That change would allow the propagation of some sort of policy information. To put it to use, seccomp mode 1 could be implemented on top of trace_syscalls. The following changes would need to happen: 1. dummy metadata should be inserted for all unhooked system calls 2. perf_trace_buf_submit would need to return an int or a new TRACE_REG_SECCOMP_REGISTER handler would need to be setup in syscall_enter_register. 3. If perf is abused, a "kill/fail_on_discard" bit would be added to event->attrs. 4. perf_trace_buf_submit/perf_tp_event will return 0 for no matches, 'n' for the number of event matches, and -EACCES/? if a 'fail_on_discard' event is seen. 5. perf_syscall_enter would set *retcode = perf_trace_buf_submit()'s retcode 6. trace_sys_enter() would need to be moved to be the first entry arch/../kernel/ptrace.c for incoming syscalls 7. if trace_sys_enter() yields a negative return code, then do_exit(SIGKILL) the process and return. Entering into seccomp mode 1 would require adding a "0" filter for every system call number (which is why we need a dummy event call for them since failing to check the bitmask can't be flagged fail_on_discard) with the fail_on_discard bit. For the three calls that are allowed, a '1' filter would be set. That would roughly implement seccomp mode 1. It's pretty ugly and the fact that every system call that's disallowed has to be blacklisted is not ideal. An alternate model would be to just use the seccomp mode as we do today and let secure_computing() handle the return code of "# of matches". If it the # of matches is 0, it terminates. A 'fail_on_discard' bit then would only be good to stop further tracepoint callback evaluation. This approach would also *almost* nix the need to provide dummy syscall hooks. (Since sigreturn isn't hooked on x86 because it uses ptregs fixup, a dummy would still be needed to apply a "1" filter to.) Even with that tweak to move to a whitelist model, the perf event evaluation and tracepoint callback ordering is still not guaranteed. Without changing tracepoint itself, all other TPs will still execute. And for perf events, it'll be first-come-first-serve until a fail_on_discard is hit. After using seccomp mode=1 as the sample case to reduce scope, it's possible to ignore it for now :) and look at the seccomp-filter/mode=2 case. The same mechanism could be used to inject more expressive filters. This would be done over the sys_perf_event_open() interface assuming the new attr is added to stop perf event list enumeration. Assuming a whitelist model, a call to prctl(PR_SET_SECCOMP, 2) would be needed (maybe with the ON_EXEC flag option too to mirror the event->attr on-exec bit). That would yield the ability to register perf events for system calls then use ioctl() to SET_FILTER on them. Reducing the privileges of the filters after installation could be done with another attribute bit like 'set_filter_ands'. If that is also set on the event, and a filter is installed to ensure that sys_perf_event_open is blocked, then it should be reasonably sane. I'd need to add a PERF_EVENT_IOC_GET_FILTER handler to allow extracting the settings. Clearly, I haven't written the code for that yet, though making the change for a single platform shouldn't be too much code. So that leaves me with some questions: - Is this the type of reuse that was being encouraged? - Does it really make sense to cram this through the perf interface and events? While the changed attributes are innocuous and potentially reusable, it seems that a large amount of the perf facilities are exposed that could have weird side effects, but I'm sure I still don't fully grok the perf infrastructure. - Does extending one tracepoint to provide return values via a pointer make sense? I'd hesitate to make all tracepoint hooks return an int by default. - If all tracepoints returned an int, what would the standard value look like - or would it need to be per tracepoint impl? - How is ambiguity resolved if multiple perf_events are registered for a syscall with different filters? Maybe a 'stop_on_match'? though ordering is still a problem then. - Is there a way to affect a task-wide change without a seccomp flag (or new task_struct entry) via the existing sys_perf_event_open syscall? I considered suggesting a attr bit call 'secure_computing' that when an event with the bit is enabled, it automagically forces the task into seccomp enforcement mode, but that, again, seemed hackish. While I like the idea of sharing the tracepoints infrastructure and the trace_syscalls hooks as well as using a pre-existing interface with very minor changes, I'm worried that the complexity of the interface and of the infrastructure might undermine the ability to continue meeting the desired security goals. I had originally stayed very close to the seccomp world because of how trivial it is to review the code and verify its accuracy/effectiveness. This approach leaves a lot of gaps for kernel code to seep through and a fair amount of ambiguity in what locked down syscall filters might look like. To me, the best part of the above is that it shows that even if we go with a prctl SET_SECCOMP_FILTER-style interface, it is completely certain that if a perf/ftrace-based security infrastructure is on our future, it will be entirely compatible -- even if the prctl() interface is just the "simpler" interface at that point somewhere down the road. Regardless, I'll hack up a proof of concept based on the outline above. Perhaps something more elegant will emerge once I start tweaking the source, but I'm still seeing too many gaps to be comfortable so far. [[There is a whole other approach to this too. We could continue with the prctl() interface and mirror the trace_sys_enter model for secure_computing(). Instead of keeping a seccomp_t-based hlist of events, they could be stored in a new hlist for seccomp_events in struct ftrace_event_call. The ftrace filters would be installed there and the seccomp_syscall_enter() function could do the checks and pass up some state data on the task's seccomp_t that indicates it needs to do_exit(). That would likely reduce the amount of code in seccomp_filter.c pretty seriously (though not entirely beneficially).]] Thanks again for all the feedback and insights! I really hope we can come to an agreeable approach for implementing kernel attack surface reduction. will