Re: [PATCH 2/9] Implement containers as kernel objects

Paul Moore <paul@xxxxxxxxxxxxxx> · Fri, 8 Sep 2017 16:02:25 -0400

On Fri, Aug 18, 2017 at 4:03 AM, Richard Guy Briggs <rgb@xxxxxxxxxx> wrote:
> On 2017-08-16 18:21, Paul Moore wrote:
>> On Mon, Aug 14, 2017 at 1:47 AM, Richard Guy Briggs <rgb@xxxxxxxxxx> wrote:
>> > Hi David,
>> >
>> > I wanted to respond to this thread to attempt some constructive feedback,
>> > better late than never.  I had a look at your fsopen/fsmount() patchset(s) to
>> > support this patchset which was interesting, but doesn't directly affect my
>> > work.  The primary patch of interest to the audit kernel folks (Paul Moore and
>> > me) is this patch while the rest of the patchset is interesting, but not likely
>> > to directly affect us.  This patch has most of what we need to solve our
>> > problem.
>> >
>> > Paul and I agree that audit is going to have a difficult time identifying
>> > containers or even namespaces without some change to the kernel.  The audit
>> > subsystem in the kernel needs at least a basic clue about which container
>> > caused an event to be able to report this at the appropriate level and ignore
>> > it at other levels to avoid a DoS.
>>
>> While there is some increased risk of "death by audit", this is really
>> only an issue once we start supporting multiple audit daemons; simply
>> associating auditable events with the container that triggered them
>> shouldn't add any additional overhead (I hope).  For a number of use
>> cases, a single auditd running outside the containers, but recording
>> all their events with some type of container attribution will be
>> sufficient.  This is step #1.
>>
>> However, we will obviously want to go a bit further and support
>> multiple audit daemons on the system to allow containers to
>> record/process their own events (side note: the non-container auditd
>> instance will still see all the events).  There are a number of ways
>> we could tackle this, both via in-kernel and in-userspace record
>> routing, each with their own pros/cons.  However, how this works is
>> going to be dependent on how we identify containers and track their
>> audit events: the bits from step #1.  For this reason I'm not really
>> interested in worrying about the multiple auditd problem just yet;
>> it's obviously important, and something to keep in mind while working
>> up a solution, but it isn't something we should focus on right now.
>>
>> > We also agree that there will need to be some sort of trigger from userspace to
>> > indicate the creation of a container and its allocated resources and we're not
>> > really picky how that is done, such as a clone flag, a syscall or a sysfs write
>> > (or even a read, I suppose), but there will need to be some permission
>> > restrictions, obviously.  (I'd like to see capabilities used for this by adding
>> > a specific container bit to the capabilities bitmask.)
>>
>> To be clear, from an audit perspective I think the only thing we would
>> really care about controlling access to is the creation and assignment
>> of a new audit container ID/token, not necessarily the container
>> itself.  It's a small point, but an important one I think.
>>
>> > I doubt we will be able to accomodate all definitions or concepts of a
>> > container in a timely fashion.  We'll need to start somewhere with a minimum
>> > definition so that we can get traction and actually move forward before another
>> > compelling shared kernel microservice method leaves our entire community
>> > behind.  I'd like to declare that a container is a full set of cloned
>> > namespaces, but this is inefficient, overly constricting and unnecessary for
>> > our needs.  If we could agree on a minimum definition of a container (which may
>> > have only one specific cloned namespace) then we have something on which to
>> > build.  I could even see a container being defined by a trigger sent from
>> > userspace about a process (task) from which all its children are considered to
>> > be within that container, subject to further nesting.
>>
>> I really would prefer if we could avoid defining the term "container".
>> Even if we manage to get it right at this particular moment, we will
>> surely be made fools a year or two from now when things change.  At
>> the very least lets avoid a rigid definition of container, I'll
>> concede that we will probably need to have some definition simply so
>> we can implement something, I just don't want the design or
>> implementation to depend on a particular definition.
>>
>> This comment is jumping ahead a bit, but from an audit perspective I
>> think we handle this by emitting an audit record whenever a container
>> ID is created which describes it as the kernel sees it; as of now that
>> probably means a list of namespace IDs.  Richard mentions this in his
>> email, I just wanted to make it clear that I think we should see this
>> as a flexible mechanism.  At the very least we will likely see a few
>> more namespaces before the world moves on from containers.
>>
>> > In the simplest usable model for audit, if a container (definition implies and)
>> > starts a PID namespace, then the container ID could simply be the container's
>> > "init" process PID in the initial PID namespace.  This assumes that as soon as
>> > that process vanishes, that entire container and all its children are killed
>> > off (which you've done).  There may be some container orchestration systems
>> > that don't use a unique PID namespace per container and that imposing this will
>> > cause them challenges.
>>
>> I don't follow how this would cause challenges if the containers do
>> not use a unique PID namespace; you are suggesting using the PID from
>> in the context of the initial PID namespace, yes?
>
> The PID of the "init" process of a container (PID=1 inside container,
> but PID=containerID from the initial PID namespace perspective).

Yep.  I still don't see how a container not creating a unique PID
namespace presents a challenge here as the unique information would be
taken from the initial PID namespace.

However, based on some off-list discussions I expect this is going to
be a non-issue in the next proposal.

>> Regardless, I do worry that using a PID could potentially be a bit
>> racy once we start jumping between kernel and userspace (audit
>> configuration, logs, etc.).
>
> How do you think this could be racy?  An event happenning before or as
> the container has been defined?

It's racy for the same reasons why we have the pid struct in the
kernel.  If the orchestrator is referencing things via a PID there is
always some danger of a mixup.

>> > If containers have at minimum a unique mount namespace then the root path
>> > dentry inode device and inode number could be used, but there are likely better
>> > identifiers.  Again, there may be container orchestrators that don't use a
>> > unique mount namespace per container and that imposing this will cause
>> > challenges.
>> >
>> > I expect there are similar examples for each of the other namespaces.
>>
>> The PID case is a bit unique as each process is going to have a unique
>> PID regardless of namespaces, but even that has some drawbacks as
>> discussed above.  As for the other namespaces, I agree that we can't
>> rely on them (see my earlier comments).
>
> (In general can you specify which earlier comments so we can be sure to
> what you are referring?)

Really?  How about the race condition concerns.  Come on Richard ...

>> > If we could pick one namespace type for consensus for which each container has
>> > a unique instance of that namespace, we could use the dev/ino tuple from that
>> > namespace as had originally been suggested by Aristeu Rozanski more than 4
>> > years ago as part of the set of namespace IDs.  I had also attempted to
>> > solve this problem by using the namespace' proc inode, then switched over to
>> > generate a unique kernel serial number for each namespace and then went back to
>> > namespace proc dev/ino once Al Viro implemented nsfs:
>> >         v1      https://lkml.org/lkml/2014/4/22/662
>> >         v2      https://lkml.org/lkml/2014/5/9/637
>> >         v3      https://lkml.org/lkml/2014/5/20/287
>> >         v4      https://lkml.org/lkml/2014/8/20/844
>> >         v5      https://lkml.org/lkml/2014/10/6/25
>> >         v6      https://lkml.org/lkml/2015/4/17/48
>> >         v7      https://lkml.org/lkml/2015/5/12/773
>> >
>> > These patches don't use a container ID, but track all namespaces in use for an
>> > event.  This has the benefit of punting this tracking to userspace for some
>> > other tool to analyse and determine to which container an event belongs.
>> > This will use a lot of bandwidth in audit log files when a single
>> > container ID that doesn't require nesting information to be complete
>> > would be a much more efficient use of audit log bandwidth.
>>
>> Relying on a particular namespace to identify a containers is a
>> non-starter from my perspective for all the reasons previously
>> discussed.
>
> I'd rather not either and suspect there isn't much danger of it, but if
> it is determined that there is one namespace in particular that is a
> minimum requirement, I'd prefer to use that nsID instead of creating an
> additional ID.
>
>> > If we rely only on the setting of arbitrary container names from userspace,
>> > then we must provide a map or tree back to the initial audit domain for that
>> > running kernel to be able to differentiate between potentially identical
>> > container names assigned in a nested container system.  If we assign a
>> > container serial number sequentially (atomic64_inc) from the kernel on request
>> > from userspace like the sessionID and log the creation with all nsIDs and the
>> > parent container serial number and/or container name, the nesting is clear due
>> > to lack of ambiguity in potential duplicate names in nesting.  If a container
>> > serial number is used, the tree of inheritance of nested containers can be
>> > rebuilt from the audit records showing what containers were spawned from what
>> > parent.
>>
>> I believe we are going to need a container ID to container definition
>> (namespace, etc.) mapping mechanism regardless of if the container ID
>> is provided by userspace or a kernel generated serial number.  This
>> mapping should be recorded in the audit log when the container ID is
>> created/defined.
>
> Agreed.
>
>> > As was suggested in one of the previous threads, if there are any events not
>> > associated with a task (incoming network packets) we log the namespace ID and
>> > then only concern ourselves with its container serial number or container name
>> > once it becomes associated with a task at which point that tracking will be
>> > more important anyways.
>>
>> Agreed.  After all, a single namespace can be shared between multiple
>> containers.  For those security officers who need to track individual
>> events like this they will have the container ID mapping information
>> in the logs as well so they should be able to trace the unassociated
>> event to a set of containers.
>>
>> > I'm not convinced that a userspace or kernel generated UUID is that useful
>> > since they are large, not human readable and may not be globally unique given
>> > the "pets vs cattle" direction we are going with potentially identical
>> > conditions in hosts or containers spawning containers, but I see no need to
>> > restrict them.
>>
>> From a kernel perspective I think an int should suffice; after all,
>> you can't have more containers then you have processes.  If the
>> container engine requires something more complex, it can use the int
>> as input to its own mapping function.
>
> PIDs roll over.  That already causes some ambiguity in reporting.  If a
> system is constantly spawning and reaping containers, especially
> single-process containers, I don't want to have to worry about that ID
> rolling to keep track of it even though there should be audit records of
> the spawn and death of each container.  There isn't significant cost
> added here compared with some of the other overhead we're dealing with.

Fine, make it a u64.  I believe that's what I've been proposing in the
off-list discussion if memory serves.

A UUID or string are not acceptable from my perspective.  Too big for
the audit records and not really necessary anyway, a u64 should be
just fine.

... and if anyone dares bring up that 640kb quote I swear I'll NACK
all their patches for the next year :)

>> > How do we deal with setns()?  Once it is determined that action is permitted,
>> > given the new combinaiton of namespaces and potential membership in a different
>> > container, record the transition from one container to another including all
>> > namespaces if the latter are a different subset than the target container
>> > initial set.
>>
>> That is a fun one, isn't it?  I think this is where the container
>> ID-to-definition mapping comes into play.  If setns() changes the
>> process such that the existing container ID is no longer valid then we
>> need to do a new lookup in the table to see if another container ID is
>> valid; if no established container ID mappings are valid, the
>> container ID becomes "undefined".
>
> Hopefully we can design this stuff so that container IDs are still valid
> while that transition occurs.
>
>> paul moore
>
> - RGB
>
> --
> Richard Guy Briggs <rgb@xxxxxxxxxx>
> Sr. S/W Engineer, Kernel Security, Base Operating Systems
> Remote, Ottawa, Red Hat Canada
> IRC: rgb, SunRaycer
> Voice: +1.647.777.2635, Internal: (81) 32635

-- 
paul moore
www.paul-moore.com
--
To unsubscribe from this list: send the line "unsubscribe cgroups" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html