Re: [PATCH 2/9] Implement containers as kernel objects

Paul Moore <paul@xxxxxxxxxxxxxx> · Wed, 16 Aug 2017 18:21:49 -0400

On Mon, Aug 14, 2017 at 1:47 AM, Richard Guy Briggs <rgb@xxxxxxxxxx> wrote:
> Hi David,
>
> I wanted to respond to this thread to attempt some constructive feedback,
> better late than never.  I had a look at your fsopen/fsmount() patchset(s) to
> support this patchset which was interesting, but doesn't directly affect my
> work.  The primary patch of interest to the audit kernel folks (Paul Moore and
> me) is this patch while the rest of the patchset is interesting, but not likely
> to directly affect us.  This patch has most of what we need to solve our
> problem.
>
> Paul and I agree that audit is going to have a difficult time identifying
> containers or even namespaces without some change to the kernel.  The audit
> subsystem in the kernel needs at least a basic clue about which container
> caused an event to be able to report this at the appropriate level and ignore
> it at other levels to avoid a DoS.

While there is some increased risk of "death by audit", this is really
only an issue once we start supporting multiple audit daemons; simply
associating auditable events with the container that triggered them
shouldn't add any additional overhead (I hope).  For a number of use
cases, a single auditd running outside the containers, but recording
all their events with some type of container attribution will be
sufficient.  This is step #1.

However, we will obviously want to go a bit further and support
multiple audit daemons on the system to allow containers to
record/process their own events (side note: the non-container auditd
instance will still see all the events).  There are a number of ways
we could tackle this, both via in-kernel and in-userspace record
routing, each with their own pros/cons.  However, how this works is
going to be dependent on how we identify containers and track their
audit events: the bits from step #1.  For this reason I'm not really
interested in worrying about the multiple auditd problem just yet;
it's obviously important, and something to keep in mind while working
up a solution, but it isn't something we should focus on right now.

> We also agree that there will need to be some sort of trigger from userspace to
> indicate the creation of a container and its allocated resources and we're not
> really picky how that is done, such as a clone flag, a syscall or a sysfs write
> (or even a read, I suppose), but there will need to be some permission
> restrictions, obviously.  (I'd like to see capabilities used for this by adding
> a specific container bit to the capabilities bitmask.)

To be clear, from an audit perspective I think the only thing we would
really care about controlling access to is the creation and assignment
of a new audit container ID/token, not necessarily the container
itself.  It's a small point, but an important one I think.

> I doubt we will be able to accomodate all definitions or concepts of a
> container in a timely fashion.  We'll need to start somewhere with a minimum
> definition so that we can get traction and actually move forward before another
> compelling shared kernel microservice method leaves our entire community
> behind.  I'd like to declare that a container is a full set of cloned
> namespaces, but this is inefficient, overly constricting and unnecessary for
> our needs.  If we could agree on a minimum definition of a container (which may
> have only one specific cloned namespace) then we have something on which to
> build.  I could even see a container being defined by a trigger sent from
> userspace about a process (task) from which all its children are considered to
> be within that container, subject to further nesting.

I really would prefer if we could avoid defining the term "container".
Even if we manage to get it right at this particular moment, we will
surely be made fools a year or two from now when things change.  At
the very least lets avoid a rigid definition of container, I'll
concede that we will probably need to have some definition simply so
we can implement something, I just don't want the design or
implementation to depend on a particular definition.

This comment is jumping ahead a bit, but from an audit perspective I
think we handle this by emitting an audit record whenever a container
ID is created which describes it as the kernel sees it; as of now that
probably means a list of namespace IDs.  Richard mentions this in his
email, I just wanted to make it clear that I think we should see this
as a flexible mechanism.  At the very least we will likely see a few
more namespaces before the world moves on from containers.

> In the simplest usable model for audit, if a container (definition implies and)
> starts a PID namespace, then the container ID could simply be the container's
> "init" process PID in the initial PID namespace.  This assumes that as soon as
> that process vanishes, that entire container and all its children are killed
> off (which you've done).  There may be some container orchestration systems
> that don't use a unique PID namespace per container and that imposing this will
> cause them challenges.

I don't follow how this would cause challenges if the containers do
not use a unique PID namespace; you are suggesting using the PID from
in the context of the initial PID namespace, yes?

Regardless, I do worry that using a PID could potentially be a bit
racy once we start jumping between kernel and userspace (audit
configuration, logs, etc.).

> If containers have at minimum a unique mount namespace then the root path
> dentry inode device and inode number could be used, but there are likely better
> identifiers.  Again, there may be container orchestrators that don't use a
> unique mount namespace per container and that imposing this will cause
> challenges.
>
> I expect there are similar examples for each of the other namespaces.

The PID case is a bit unique as each process is going to have a unique
PID regardless of namespaces, but even that has some drawbacks as
discussed above.  As for the other namespaces, I agree that we can't
rely on them (see my earlier comments).

> If we could pick one namespace type for consensus for which each container has
> a unique instance of that namespace, we could use the dev/ino tuple from that
> namespace as had originally been suggested by Aristeu Rozanski more than 4
> years ago as part of the set of namespace IDs.  I had also attempted to
> solve this problem by using the namespace' proc inode, then switched over to
> generate a unique kernel serial number for each namespace and then went back to
> namespace proc dev/ino once Al Viro implemented nsfs:
>         v1      https://lkml.org/lkml/2014/4/22/662
>         v2      https://lkml.org/lkml/2014/5/9/637
>         v3      https://lkml.org/lkml/2014/5/20/287
>         v4      https://lkml.org/lkml/2014/8/20/844
>         v5      https://lkml.org/lkml/2014/10/6/25
>         v6      https://lkml.org/lkml/2015/4/17/48
>         v7      https://lkml.org/lkml/2015/5/12/773
>
> These patches don't use a container ID, but track all namespaces in use for an
> event.  This has the benefit of punting this tracking to userspace for some
> other tool to analyse and determine to which container an event belongs.
> This will use a lot of bandwidth in audit log files when a single
> container ID that doesn't require nesting information to be complete
> would be a much more efficient use of audit log bandwidth.

Relying on a particular namespace to identify a containers is a
non-starter from my perspective for all the reasons previously
discussed.

> If we rely only on the setting of arbitrary container names from userspace,
> then we must provide a map or tree back to the initial audit domain for that
> running kernel to be able to differentiate between potentially identical
> container names assigned in a nested container system.  If we assign a
> container serial number sequentially (atomic64_inc) from the kernel on request
> from userspace like the sessionID and log the creation with all nsIDs and the
> parent container serial number and/or container name, the nesting is clear due
> to lack of ambiguity in potential duplicate names in nesting.  If a container
> serial number is used, the tree of inheritance of nested containers can be
> rebuilt from the audit records showing what containers were spawned from what
> parent.

I believe we are going to need a container ID to container definition
(namespace, etc.) mapping mechanism regardless of if the container ID
is provided by userspace or a kernel generated serial number.  This
mapping should be recorded in the audit log when the container ID is
created/defined.

> As was suggested in one of the previous threads, if there are any events not
> associated with a task (incoming network packets) we log the namespace ID and
> then only concern ourselves with its container serial number or container name
> once it becomes associated with a task at which point that tracking will be
> more important anyways.

Agreed.  After all, a single namespace can be shared between multiple
containers.  For those security officers who need to track individual
events like this they will have the container ID mapping information
in the logs as well so they should be able to trace the unassociated
event to a set of containers.

> I'm not convinced that a userspace or kernel generated UUID is that useful
> since they are large, not human readable and may not be globally unique given
> the "pets vs cattle" direction we are going with potentially identical
> conditions in hosts or containers spawning containers, but I see no need to
> restrict them.

>From a kernel perspective I think an int should suffice; after all,
you can't have more containers then you have processes.  If the
container engine requires something more complex, it can use the int
as input to its own mapping function.

> How do we deal with setns()?  Once it is determined that action is permitted,
> given the new combinaiton of namespaces and potential membership in a different
> container, record the transition from one container to another including all
> namespaces if the latter are a different subset than the target container
> initial set.

That is a fun one, isn't it?  I think this is where the container
ID-to-definition mapping comes into play.  If setns() changes the
process such that the existing container ID is no longer valid then we
need to do a new lookup in the table to see if another container ID is
valid; if no established container ID mappings are valid, the
container ID becomes "undefined".

-- 
paul moore
www.paul-moore.com
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html