Peter Zijlstra <peterz@xxxxxxxxxxxxx> writes: > On Wed, Aug 03, 2016 at 11:53:41AM -0700, Kees Cook wrote: >> > Kees Cook <keescook@xxxxxxxxxxxx> writes: >> > >> >> On Tue, Aug 2, 2016 at 1:30 PM, Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote: >> >> Let me take this another way instead. What would be a better way to >> >> provide a mechanism for system owners to disable perf without an LSM? >> >> (Since far fewer folks run with an enforcing "big" LSM: I'm seeking as >> >> wide a coverage as possible.) >> > >> > I vote for sandboxes. Perhaps seccomp. Perhaps a per userns sysctl. >> > Perhaps something else. >> >> Peter, did you happen to see Eric's solution to this problem for >> namespaces? Basically, a per-userns sysctl instead of a global sysctl. >> Is that something that would be acceptable here? > > Someone would have to educate me on what a userns is and how that would > help here. userns is an abbreviation for user namespace. How it might help is that it is an easy unescapable context for processes. Essentialy the idea is to limit the scope of the sysctl to a container. User namespaces run into flack because while tremendously simple in themselves the code takes advantage of the fact that suid root executables in a user namespace do not have privileges on anything outside of the user namespace. Which means that it is semantically safe to allow operations like creating mount namespaces, mount filesystems, creating network namespaces, manipulating the network stack etc. All of which allows unprivileged users (that can create network namespaces) to exercise more kernel code and exercise those bugs. Fundamentally user namespaces as objects you can create need limits on the maximum number of user namespaces you can create to cawtch run away processes. Set the limit you can create to 0 and you get what Kees wants. In my pending patches that were not quite ready for the merge window, I added a sysctl that described the maximum number of user namepaces that could be created (default value threads-max), and implemented the sysctl in a per user way. Such that counts and limits were kept for every user namespace. In a nested user namespace (which are all of them except for the initial user namspace) the count and limit would be checked in the current user namepsace, then the count would be incremented in the parent and verified the count was below the limit in the parent user namespace. What this means in practice is user namespaces can be enabled by default on a system, and yet you can easily disable them in a sandbox that was built with a user namespace. I named the new sysctls in my patch: /proc/sys/userns/max_user_namespaces /proc/sys/userns/max_pid_namespaces /proc/sys/userns/max_net_namespaces /proc/sys/userns/max_uts_namespaces /proc/sys/userns/max_ipc_namespaces /proc/sys/userns/max_cgroup_namespaces /proc/sys/userns/max_mnt_namespaces What Kees was suggesting was to add a similar sysctl say: /proc/sys/userns/perf_event_enabled And have the ability to disable perf events in each user namespaces. While still being able to leave usage perf events enabled by default. I don't know if any of that is a good fit for perf events. For purposes of this discussion I assume we are limiting ourselves to discussing userspace tracing, which semantically is 100% fine for access by userspace. Eric -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html