On Fri, 2018-02-02 at 16:24 -0500, Paul Moore wrote: > On Wed, Jan 10, 2018 at 2:00 AM, Richard Guy Briggs <rgb@xxxxxxxxxx> wrote: > > On 2018-01-09 11:18, Simo Sorce wrote: > > > On Tue, 2018-01-09 at 07:16 -0500, Richard Guy Briggs wrote: > > > > Containers are a userspace concept. The kernel knows nothing of them. > > > > > > > > The Linux audit system needs a way to be able to track the container > > > > provenance of events and actions. Audit needs the kernel's help to do > > > > this. > > > > > > > > Since the concept of a container is entirely a userspace concept, a > > > > registration from the userspace container orchestration system initiates > > > > this. This will define a point in time and a set of resources > > > > associated with a particular container with an audit container > > > > identifier. > > > > > > > > The registration is a u64 representing the audit container identifier > > > > written to a special file in a pseudo filesystem (proc, since PID tree > > > > already exists) representing a process that will become a parent process > > > > in that container. This write might place restrictions on mount > > > > namespaces required to define a container, or at least careful checking > > > > of namespaces in the kernel to verify permissions of the orchestrator so > > > > it can't change its own container ID. A bind mount of nsfs may be > > > > necessary in the container orchestrator's mount namespace. This write > > > > can only happen once per process. > > > > > > > > Note: The justification for using a u64 is that it minimizes the > > > > information printed in every audit record, reducing bandwidth and limits > > > > comparisons to a single u64 which will be faster and less error-prone. > > > > > > > > Require CAP_AUDIT_CONTROL to be able to carry out the registration. At > > > > that time, record the target container's user-supplied audit container > > > > identifier along with a target container's parent process (which may > > > > become the target container's "init" process) process ID (referenced > > > > from the initial PID namespace) in a new record AUDIT_CONTAINER with a > > > > qualifying op=$action field. > > > > > > > > Issue a new auxilliary record AUDIT_CONTAINER_INFO for each valid > > > > container ID present on an auditable action or event. > > > > > > > > Forked and cloned processes inherit their parent's audit container > > > > identifier, referenced in the process' task_struct. Since the audit > > > > container identifier is inherited rather than written, it can still be > > > > written once. This will prevent tampering while allowing nesting. > > > > (This can be implemented with an internal settable flag upon > > > > registration that does not get copied across a fork/clone.) > > > > > > > > Mimic setns(2) and return an error if the process has already initiated > > > > threading or forked since this registration should happen before the > > > > process execution is started by the orchestrator and hence should not > > > > yet have any threads or children. If this is deemed overly restrictive, > > > > switch all of the target's threads and children to the new containerID. > > > > > > > > Trust the orchestrator to judiciously use and restrict CAP_AUDIT_CONTROL. > > > > > > > > When a container ceases to exist because the last process in that > > > > container has exited log the fact to balance the registration action. > > > > (This is likely needed for certification accountability.) > > > > > > > > At this point it appears unnecessary to add a container session > > > > identifier since this is all tracked from loginuid and sessionid to > > > > communicate with the container orchestrator to spawn an additional > > > > session into an existing container which would be logged. It can be > > > > added at a later date without breaking API should it be deemed > > > > necessary. > > > > > > > > The following namespace logging actions are not needed for certification > > > > purposes at this point, but are helpful for tracking namespace activity. > > > > These are auxilliary records that are associated with namespace > > > > manipulation syscalls unshare(2), clone(2) and setns(2), so the records > > > > will only show up if explicit syscall rules have been added to document > > > > this activity. > > > > > > > > Log the creation of every namespace, inheriting/adding its spawning > > > > process' audit container identifier(s), if applicable. Include the > > > > spawning and spawned namespace IDs (device and inode number tuples). > > > > [AUDIT_NS_CREATE, AUDIT_NS_DESTROY] [clone(2), unshare(2), setns(2)] > > > > Note: At this point it appears only network namespaces may need to track > > > > container IDs apart from processes since incoming packets may cause an > > > > auditable event before being associated with a process. Since a > > > > namespace can be shared by processes in different containers, the > > > > namespace will need to track all containers to which it has been > > > > assigned. > > > > > > > > Upon registration, the target process' namespace IDs (in the form of a > > > > nsfs device number and inode number tuple) will be recorded in an > > > > AUDIT_NS_INFO auxilliary record. > > > > > > > > Log the destruction of every namespace that is no longer used by any > > > > process, including the namespace IDs (device and inode number tuples). > > > > [AUDIT_NS_DESTROY] [process exit, unshare(2), setns(2)] > > > > > > > > Issue a new auxilliary record AUDIT_NS_CHANGE listing (opt: op=$action) > > > > the parent and child namespace IDs for any changes to a process' > > > > namespaces. [setns(2)] > > > > Note: It may be possible to combine AUDIT_NS_* record formats and > > > > distinguish them with an op=$action field depending on the fields > > > > required for each message type. > > > > > > > > The audit container identifier will need to be reaped from all > > > > implicated namespaces upon the destruction of a container. > > > > > > > > This namespace information adds supporting information for tracking > > > > events not attributable to specific processes. > > > > > > > > Changelog: > > > > > > > > (Upstream V3) > > > > - switch back to u64 (from pmoore, can be expanded to u128 in future if > > > > need arises without breaking API. u32 was originally proposed, up to > > > > c36 discussed) > > > > - write-once, but children inherit audit container identifier and can > > > > then still be written once > > > > - switch to CAP_AUDIT_CONTROL > > > > - group namespace actions together, auxilliary records to namespace > > > > operations. > > > > > > > > (Upstream V2) > > > > - switch from u64 to u128 UUID > > > > - switch from "signal" and "trigger" to "register" > > > > - restrict registration to single process or force all threads and > > > > children into same container > > > > > > I am trying to understand the back and forth on the ID size. > > > > > > From an orchestrator POV anything that requires tracking a node > > > specific ID is not ideal. > > > > > > Orchestrators tend to span many nodes, and containers tend to have IDs > > > that are either UUID or have a Hash (like SHA256) as identifier. > > > > > > The problem here is two-fold: > > > > > > a) Your auditing requires some mapping to be useful outside of the > > > system. > > > If you aggreggate audit logs outside of the system or you want to > > > correlate the system audit logs with other components dealing with > > > containers, now you need a place where you provide a mapping from your > > > audit u64 to the ID a container has in the rest of the system. > > > > > > b) Now you need a mapping of some sort. The simplest way a container > > > orchestrator can go about this is to just use the UUID or Hash > > > representing their view of the container, truncate it to a u64 and use > > > that for Audit. This means there are some chances there will be a > > > collision and a duplicate u64 ID will be used by the orchestrator as > > > the container ID. What happen in that case ? > > > > Paul, can you justify this somewhat larger inconvenience for some > > relatively minor convenience on our part? > > Done in direct response to Simo. Sorry but your response sounds more like waving away then addressing them, the excuse being: we can't please everyone, so we are going to please no one. > But to be clear Richard, we've talked about this a few times, it's not > a "minor convenience" on our part, it's a pretty big convenience once > we starting having to route audit events and make decisions based on > the audit container ID information. Audit performance is less than > awesome now, I'm working hard to not make it worse. Sounds like a security vs performance trade off to me. > > u64 vs u128 is easy for us to > > accomodate in terms of scalar comparisons. It doubles the information > > in every container id field we print in audit records. > > ... and slows down audit container ID checks. Are you saying a cmp on a u128 is slower than a comparison on a u64 and this is something that will be noticeable ? > > A c36 is a bigger step. > > Yeah, we're not doing that, no way. Ok, I can see your point though I do not agree with it. I can see why you do not want to have arbitrary length strings, but a u128 sounded like a reasonable compromise to me as it has enough room to be able to have unique cluster-wide IDs which a u64 definitely makes a lot harder to provide w/o tight coordination. Simo. -- Simo Sorce Sr. Principal Software Engineer Red Hat, Inc -- To unsubscribe from this list: send the line "unsubscribe cgroups" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html