Hi John, On 18 October 2016 at 01:35, John Stultz <john.stultz@xxxxxxxxxx> wrote: > On Mon, Oct 17, 2016 at 3:40 PM, Andy Lutomirski <luto@xxxxxxxxxxxxxx> wrote: >> On Mon, Oct 17, 2016 at 3:35 PM, John Stultz <john.stultz@xxxxxxxxxx> wrote: >>> This patch adds CAP_GROUP_MIGRATE and logic to allows a process >>> to migrate other tasks between cgroups. >>> >>> In Android (where this feature originated), the ActivityManager tracks >>> various application states (TOP_APP, FOREGROUND, BACKGROUND, SYSTEM, >>> etc), and then as applications change states, the SchedPolicy logic >>> will migrate the application tasks between different cgroups used >>> to control the different application states (for example, there is a >>> background cpuset cgroup which can limit background tasks to stay >>> on one low-power cpu, and the bg_non_interactive cpuctrl cgroup can >>> then further limit those background tasks to a small percentage of >>> that one cpu's cpu time). >>> >>> However, for security reasons, Android doesn't want to make the >>> system_server (the process that runs the ActivityManager and >>> SchedPolicy logic), run as root. So in the Android common.git >>> kernel, they have some logic to allow cgroups to loosen their >>> permissions so CAP_SYS_NICE tasks can migrate other tasks between >>> cgroups. >>> >>> The approach taken there overloads CAP_SYS_NICE a bit much, and >>> is maybe more complicated then needed. >>> >>> So this patch, as suggested by Tejun, simply adds a new process >>> capability flag (CAP_CGROUP_MIGRATE), and uses it when checking >>> if a task can migrate other tasks between cgroups. >>> >>> I've tested this with AOSP master (though its a bit hacked in as I >>> still need to properly get the selinux bits aware of the new >>> capability bit) with selinux set to permissive and it seems to be >>> working well. >>> >>> Thoughts and feedback would be appreciated! >>> >>> Cc: Tejun Heo <tj@xxxxxxxxxx> >>> Cc: Li Zefan <lizefan@xxxxxxxxxx> >>> Cc: Jonathan Corbet <corbet@xxxxxxx> >>> Cc: cgroups@xxxxxxxxxxxxxxx >>> Cc: Android Kernel Team <kernel-team@xxxxxxxxxxx> >>> Cc: Rom Lemarchand <romlem@xxxxxxxxxxx> >>> Cc: Colin Cross <ccross@xxxxxxxxxxx> >>> Cc: Dmitry Shmidt <dimitrysh@xxxxxxxxxx> >>> Cc: Ricky Zhou <rickyz@xxxxxxxxxxxx> >>> Cc: Dmitry Torokhov <dmitry.torokhov@xxxxxxxxx> >>> Cc: Todd Kjos <tkjos@xxxxxxxxxx> >>> Cc: Christian Poetzsch <christian.potzsch@xxxxxxxxxx> >>> Cc: Amit Pundir <amit.pundir@xxxxxxxxxx> >>> Cc: Serge E. Hallyn <serge@xxxxxxxxxx> >>> Cc: linux-api@xxxxxxxxxxxxxxx >>> Signed-off-by: John Stultz <john.stultz@xxxxxxxxxx> >>> --- >>> v2: Renamed to just CAP_CGROUP_MIGRATE as reccomended by Tejun >>> --- >>> include/uapi/linux/capability.h | 5 ++++- >>> kernel/cgroup.c | 3 ++- >>> 2 files changed, 6 insertions(+), 2 deletions(-) >>> >>> diff --git a/include/uapi/linux/capability.h b/include/uapi/linux/capability.h >>> index 49bc062..44d7ff4 100644 >>> --- a/include/uapi/linux/capability.h >>> +++ b/include/uapi/linux/capability.h >>> @@ -349,8 +349,11 @@ struct vfs_cap_data { >>> >>> #define CAP_AUDIT_READ 37 >>> >>> +/* Allow migrating tasks between cgroups */ >>> >>> -#define CAP_LAST_CAP CAP_AUDIT_READ >>> +#define CAP_CGROUP_MIGRATE 38 >>> + >>> +#define CAP_LAST_CAP CAP_CGROUP_MIGRATE >>> >>> #define cap_valid(x) ((x) >= 0 && (x) <= CAP_LAST_CAP) >>> >>> diff --git a/kernel/cgroup.c b/kernel/cgroup.c >>> index 85bc9be..09f84d2 100644 >>> --- a/kernel/cgroup.c >>> +++ b/kernel/cgroup.c >>> @@ -2856,7 +2856,8 @@ static int cgroup_procs_write_permission(struct task_struct *task, >>> */ >>> if (!uid_eq(cred->euid, GLOBAL_ROOT_UID) && >>> !uid_eq(cred->euid, tcred->uid) && >>> - !uid_eq(cred->euid, tcred->suid)) >>> + !uid_eq(cred->euid, tcred->suid) && >>> + !ns_capable(tcred->user_ns, CAP_CGROUP_MIGRATE)) >>> ret = -EACCES; >> >> This logic seems rather confused to me. Without this patch, a user >> can write to procs if it's root *or* it matches the target uid *or* it >> matches the target suid. How does this make sense? How about >> ptrace_may_access(...) || ns_capable(tcred->user_ns, >> CAP_CGROUP_MIGRATE)? > > Though ptrace_may_access would open it also to apps with > CAP_SYS_PTRACE as well, no? > > Would pulling out from __ptrace_may_access the: > if (uid_eq(caller_uid, tcred->euid) && > uid_eq(caller_uid, tcred->suid) && > uid_eq(caller_uid, tcred->uid) && > gid_eq(caller_gid, tcred->egid) && > gid_eq(caller_gid, tcred->sgid) && > gid_eq(caller_gid, tcred->gid)) > goto ok; > > check and creating a new helper that could be shared between them be > the right approach? So, is creating a new capability here necessarily the right approach? Is this operation so unique, or is there an existing silo (not CAP_SYS_ADMIN) that we can re-use? I ask, because we currently use 38 silos out of a possible 64 capabilities, and when everyone chooses single-use capabilities, we will quickly exhaust the silos. I'm not saying that creating a new capability here is wrong, but it is worth further considering the existing silos to see if there is one that is a suitable match. Looking at http://man7.org/linux/man-pages/man7/capabilities.7.html throws up the following possibilities: CAP_SYS_NICE CAP_SYS_PTRACE CAP_SYS_RESOURCE I'm aware that you said above that use CAP_SYS_NICE overloads that capability a bit too much. Maybe it's true, but on the other hand, by my count from dome rough grepping of the kernel source, there are a total of 14 capable() checks for CAP_SYS_NICE, out of a total of around 1256 capable() checks altogether. So, I think this does need to be balanced against the limited number of silos. Also, CAP_SYS_RESOURCE deserves consideration (34 uses in capable() checks). I'd say, since cgroups are about resources, so there's something of a match there., so it's also worth considering. Cheers, Michael -- Michael Kerrisk Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/ Linux/UNIX System Programming Training: http://man7.org/training/ -- To unsubscribe from this list: send the line "unsubscribe cgroups" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html