On Tue, Apr 24, 2018 at 8:40 PM, Richard Guy Briggs <rgb@xxxxxxxxxx> wrote: > On 2018-04-24 15:01, Paul Moore wrote: >> On Mon, Apr 23, 2018 at 10:02 PM, Richard Guy Briggs <rgb@xxxxxxxxxx> wrote: >> > On 2018-04-23 19:15, Paul Moore wrote: >> >> On Sat, Apr 21, 2018 at 10:34 AM, Richard Guy Briggs <rgb@xxxxxxxxxx> wrote: >> >> > On 2018-04-18 19:47, Paul Moore wrote: >> >> >> On Fri, Mar 16, 2018 at 5:00 AM, Richard Guy Briggs <rgb@xxxxxxxxxx> wrote: >> >> >> > Implement the proc fs write to set the audit container ID of a process, >> >> >> > emitting an AUDIT_CONTAINER record to document the event. >> >> >> > >> >> >> > This is a write from the container orchestrator task to a proc entry of >> >> >> > the form /proc/PID/containerid where PID is the process ID of the newly >> >> >> > created task that is to become the first task in a container, or an >> >> >> > additional task added to a container. >> >> >> > >> >> >> > The write expects up to a u64 value (unset: 18446744073709551615). >> >> >> > >> >> >> > This will produce a record such as this: >> >> >> > type=CONTAINER msg=audit(1519903238.968:261): op=set pid=596 uid=0 subj=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 auid=0 tty=pts0 ses=1 opid=596 old-contid=18446744073709551615 contid=123455 res=0 >> >> >> > >> >> >> > The "op" field indicates an initial set. The "pid" to "ses" fields are >> >> >> > the orchestrator while the "opid" field is the object's PID, the process >> >> >> > being "contained". Old and new container ID values are given in the >> >> >> > "contid" fields, while res indicates its success. >> >> >> > >> >> >> > It is not permitted to self-set, unset or re-set the container ID. A >> >> >> > child inherits its parent's container ID, but then can be set only once >> >> >> > after. >> >> >> > >> >> >> > See: https://github.com/linux-audit/audit-kernel/issues/32 >> >> >> > >> >> >> > Signed-off-by: Richard Guy Briggs <rgb@xxxxxxxxxx> >> >> >> > --- >> >> >> > fs/proc/base.c | 37 ++++++++++++++++++++ >> >> >> > include/linux/audit.h | 16 +++++++++ >> >> >> > include/linux/init_task.h | 4 ++- >> >> >> > include/linux/sched.h | 1 + >> >> >> > include/uapi/linux/audit.h | 2 ++ >> >> >> > kernel/auditsc.c | 84 ++++++++++++++++++++++++++++++++++++++++++++++ >> >> >> > 6 files changed, 143 insertions(+), 1 deletion(-) >> >> ... >> >> >> >> > /* audit_rule_data supports filter rules with both integer and string >> >> >> > * fields. It corresponds with AUDIT_ADD_RULE, AUDIT_DEL_RULE and >> >> >> > diff --git a/kernel/auditsc.c b/kernel/auditsc.c >> >> >> > index 4e0a4ac..29c8482 100644 >> >> >> > --- a/kernel/auditsc.c >> >> >> > +++ b/kernel/auditsc.c >> >> >> > @@ -2073,6 +2073,90 @@ int audit_set_loginuid(kuid_t loginuid) >> >> >> > return rc; >> >> >> > } >> >> >> > >> >> >> > +static int audit_set_containerid_perm(struct task_struct *task, u64 containerid) >> >> >> > +{ >> >> >> > + struct task_struct *parent; >> >> >> > + u64 pcontainerid, ccontainerid; >> >> >> > + >> >> >> > + /* Don't allow to set our own containerid */ >> >> >> > + if (current == task) >> >> >> > + return -EPERM; >> >> >> >> >> >> Why not? Is there some obvious security concern that I missing? >> >> > >> >> > We then lose the distinction in the AUDIT_CONTAINER record between the >> >> > initiating PID and the target PID. This was outlined in the proposal. >> >> >> >> I just went back and reread the v3 proposal and I still don't see a >> >> good explanation of this. Why is this bad? What's the security >> >> concern? >> > >> > I don't remember, specifically. Maybe this has been addressed by the >> > check for children/threads or identical parent container ID. So, I'm >> > reluctantly willing to remove that check for now. >> >> Okay. For the record, if someone can explain to me why this >> restriction saves us from some terrible situation I'm all for leaving >> it. I'm just opposed to restrictions without solid reasoning behind >> them. >> >> >> > Having said that, I'm still not sure we have protected sufficiently from >> >> > a child turning around and setting it's parent's as yet unset or >> >> > inherited audit container ID. >> >> >> >> Yes, I believe we only want to let a task set the audit container for >> >> it's children (or itself/threads if we decide to allow that, see >> >> above). There *has* to be a function to check to see if a task if a >> >> child of a given task ... right? ... although this is likely to be a >> >> pointer traversal and locking nightmare ... hmmm. >> > >> > Isn't that just (struct task_struct)parent == (struct >> > task_struct)child->parent (or ->real_parent)? >> > >> > And now that I say that, it is covered by the following patch's child >> > check, so as long as we keep that, we should be fine. >> >> I was thinking of checking not just current's immediate children, but >> any of it's descendants as I believe that is what we want to limit, >> yes? I just worry that it isn't really practical to perform that >> check. > > The child check I'm talking about prevents setting a task's audit > container ID if it *has* any children or threads, so if it has children > it is automatically disqualified and its grandchildren are irrelevant. > >> >> >> I ask because I suppose it might be possible for some container >> >> >> runtime to do a fork, setup some of the environment and them exec the >> >> >> container (before you answer the obvious "namespaces!" please remember >> >> >> we're not trying to define containers). >> >> > >> >> > I don't think namespaces have any bearing on this concern since none are >> >> > required. >> >> > >> >> >> > + /* Don't allow the containerid to be unset */ >> >> >> > + if (!cid_valid(containerid)) >> >> >> > + return -EINVAL; >> >> >> > + /* if we don't have caps, reject */ >> >> >> > + if (!capable(CAP_AUDIT_CONTROL)) >> >> >> > + return -EPERM; >> >> >> > + /* if containerid is unset, allow */ >> >> >> > + if (!audit_containerid_set(task)) >> >> >> > + return 0; >> >> >> > + /* it is already set, and not inherited from the parent, reject */ >> >> >> > + ccontainerid = audit_get_containerid(task); >> >> >> > + rcu_read_lock(); >> >> >> > + parent = rcu_dereference(task->real_parent); >> >> >> > + rcu_read_unlock(); >> >> >> > + task_lock(parent); >> >> >> > + pcontainerid = audit_get_containerid(parent); >> >> >> > + task_unlock(parent); >> >> >> > + if (ccontainerid != pcontainerid) >> >> >> > + return -EPERM; >> >> >> > + return 0; >> >> I'm looking at the parent checks again and I wonder if the logic above >> is what we really want. Maybe it is, but I'm not sure. >> >> Things I'm wondering about: >> >> * "ccontainerid" and "containerid" are too close in name, I kept >> confusing myself when looking at this code. Please change one. Bonus >> points if it is shorter. > > Would c_containerid and p_containerid be ok? child_cid and parent_cid? Either would be an improvement over ccontainerid/containerid. I would give a slight node to child_cid/parent_cid just for length reasons. > I'd really like it to have the same root as the parameter handed in so > teh code is easier to follow. It would be nice to have that across > caller to local, but that's challenging. That's fine, but you have to admit that ccontainerid/containerid is awkward and not easy to quickly differentiate :) > I've been tempted to use contid or even cid everywhere instead of > containerid. Perhaps the longer name doesn't bother me because I > like its uniqueness and I learned touch-typing in grade 9 and I like > 100+ character wide terminals? ;-) I would definitely appreciate contid/cid or similar, but I don't care too much either way. As far as terminal width is concerned, please make sure your code fits in 80 char terminals. >> * What if the orchestrator wants to move the task to a new container? >> Right now it looks like you can only do that once, then then the >> task's audit container ID will no longer be the same as real_parent > > A task's audit container ID can be unset or inherited, and then set > only once. After that, if you want it moved to a new container you > can't and your only option is to spawn another peer to that task or a > child of it and set that new task's audit container ID. Okay. We've had some many discussions about this both on and off list that I lose track on where we stand for certain things. I think preventing task movement is fine for the initial effort so long as we don't prevent adding it in the future; I don't see anything (other than the permission checks under discussion, which is fine) preventing this. > Currently, the method of detecting if its audit container ID has been > set (rather than inherited) was to check its parent's audit container > ID. Yeah ... those are two different things. I've been wondering if we should introduce a set/inherited flag as simply checking the parent task's audit container ID isn't quite the same; although it may be "close enough" that it doesn't matter in practice. However, I'm beginning to think this parent/child relationship isn't really important beyond the inheritance issue ... more on this below. > The only reason to change this might be if the audit container ID > were not inheritable, but then we lose the accountability of a task > spawning another process and being able to leave its child's audit > container ID unset and unaccountable to any existing container. I think > the relationship to the parent is crucial, and if something wants to > change audit container ID it can, by spawning childrent and leaving a > trail of container IDs in its parent processes. (So what if a parent > dies?) The audit container ID *must* be inherited, I don't really think anyone is questioning that. What I'm wondering about is what we accomplish by comparing the child's and parent's audit container ID? I've thought about this a bit more and I think we are making this way too complicated right now. We basically have three rules for the audit container ID which we need to follow: 1. Children inherit their parent's audit container ID; this includes the magic "unset" audit container ID. 2. You can't change the audit container ID once set. 3. In order to set the audit container ID of a process you must have CAP_AUDIT_CONTROL. With that in mind, I think the permission checks would be something like this: [SIDE NOTE: Audit Container ID in acronym form works out to "acid" ;) ] int perm(task, acid) { if (!task || !valid(acid)) return -EINVAL; if (!capable(CAP_AUDIT_CONTROL)) return -EPERM; if (task->acid != UNSET) return -EPERM; return 0; } >> ... or does the orchestrator change that? *Can* the orchestrator >> change real_parent (I suspect the answer is "no")? > > I don't think the orchestrator is able to change real_parent. I didn't think so either, but I didn't do an exhaustive check. > I've forgotten why there is a ->parent and ->real_parent and how they can > change. One is for the wait signal. I don't remember the purpose of > the other. I know ptrace makes use of real_parent when re-parenting the process being ptrace'd. > If the parent dies before the child, the child will be re-parented on > its grandparent if the parent doesn't hang around zombified, if I > understand correctly. If anything, a parent dying would likely further > restrict the ability to set a task's audit container ID because a parent > with an identical ID could vanish. All the more reason to go with the simplified approach above. I think the parent/child relationship is a bit of a distraction and a complexity that isn't important (except for the inheritance of course). >> * I think the key is the relationship between current and task, not >> between task and task->real_parent. I believe what we really care >> about is that task is a descendant of current. We might also want to >> allow current to change the audit container ID if it holds >> CAP_AUDIT_CONTROL, regardless of it's relationship with task. > > Currently, a process with CAP_AUDIT_CONTROL can set the audit container > ID of any task that hasn't got children or threads, isn't itself, and > its audit container ID is inherited or unset. This was to try to > prevent games with parents and children scratching each other's backs. > > I would feel more comfortable if only descendants were settable, so > adding that restriction sounds like a good idea to me other than the > tree-climbing excercise and overhead involved. > >> >> >> > +static void audit_log_set_containerid(struct task_struct *task, u64 oldcontainerid, >> >> >> > + u64 containerid, int rc) >> >> >> > +{ >> >> >> > + struct audit_buffer *ab; >> >> >> > + uid_t uid; >> >> >> > + struct tty_struct *tty; >> >> >> > + >> >> >> > + if (!audit_enabled) >> >> >> > + return; >> >> >> > + >> >> >> > + ab = audit_log_start(NULL, GFP_KERNEL, AUDIT_CONTAINER); >> >> >> > + if (!ab) >> >> >> > + return; >> >> >> > + >> >> >> > + uid = from_kuid(&init_user_ns, task_uid(current)); >> >> >> > + tty = audit_get_tty(current); >> >> >> > + >> >> >> > + audit_log_format(ab, "op=set pid=%d uid=%u", task_tgid_nr(current), uid); >> >> >> > + audit_log_task_context(ab); >> >> >> > + audit_log_format(ab, " auid=%u tty=%s ses=%u opid=%d old-contid=%llu contid=%llu res=%d", >> >> >> > + from_kuid(&init_user_ns, audit_get_loginuid(current)), >> >> >> > + tty ? tty_name(tty) : "(none)", audit_get_sessionid(current), >> >> >> > + task_tgid_nr(task), oldcontainerid, containerid, !rc); >> >> >> > + >> >> >> > + audit_put_tty(tty); >> >> >> > + audit_log_end(ab); >> >> >> > +} >> >> >> > + >> >> >> > +/** >> >> >> > + * audit_set_containerid - set current task's audit_context containerid >> >> >> > + * @containerid: containerid value >> >> >> > + * >> >> >> > + * Returns 0 on success, -EPERM on permission failure. >> >> >> > + * >> >> >> > + * Called (set) from fs/proc/base.c::proc_containerid_write(). >> >> >> > + */ >> >> >> > +int audit_set_containerid(struct task_struct *task, u64 containerid) >> >> >> > +{ >> >> >> > + u64 oldcontainerid; >> >> >> > + int rc; >> >> >> > + >> >> >> > + oldcontainerid = audit_get_containerid(task); >> >> >> > + >> >> >> > + rc = audit_set_containerid_perm(task, containerid); >> >> >> > + if (!rc) { >> >> >> > + task_lock(task); >> >> >> > + task->containerid = containerid; >> >> >> > + task_unlock(task); >> >> >> > + } >> >> >> > + >> >> >> > + audit_log_set_containerid(task, oldcontainerid, containerid, rc); >> >> >> > + return rc; >> >> >> >> >> >> Why are audit_set_containerid_perm() and audit_log_containerid() >> >> >> separate functions? >> >> > >> >> > (I assume you mean audit_log_set_containerid()?) >> >> >> >> Yep. My fingers got tired typing in that function name and decided a >> >> shortcut was necessary. >> >> >> >> > It seemed clearer that all the permission checking was in one function >> >> > and its return code could be used to report the outcome when logging the >> >> > (attempted) action. This is the same structure as audit_set_loginuid() >> >> > and it made sense. >> >> >> >> When possible I really like it when the permission checks are in the >> >> same function as the code which does the work; it's less likely to get >> >> abused that way (you have to willfully bypass the access checks). The >> >> exceptions might be if you wanted to reuse the access control code, or >> >> insert a modular access mechanism (e.g. LSMs). >> > >> > I don't follow how it could be abused. The return code from the perm >> > check gates setting the value and is used in the success field in the >> > log. >> >> If the permission checks are in the same function body as the code >> which does the work you have to either split the function, or rewrite >> it, if you want to bypass the permission checks. It may be more of a >> style issue than an actual safety issue, but the comments about >> single-use functions in the same scope is the tie breaker. > > Perhaps I'm just being quite dense, but I just don't follow what the > problem is and how you suggest fixing it. A bunch of gotos to a label > such as "out:" to log the refused action? That seems messy and > unstructured. Fold audit_set_containerid_perm() and audit_log_set_containerid() into their only caller, audit_set_containerid(). -- paul moore www.paul-moore.com _______________________________________________ Containers mailing list Containers@xxxxxxxxxxxxxxxxxxxxxxxxxx https://lists.linuxfoundation.org/mailman/listinfo/containers