[RFC][PATCH 0/9] Make containers kernel objects

David Howells <dhowells@xxxxxxxxxx> · Mon, 22 May 2017 17:22:26 +0100

Here are a set of patches to define a container object for the kernel and
to provide some methods to create and manipulate them.

The reason I think this is necessary is that the kernel has no idea how to
direct upcalls to what userspace considers to be a container - current
Linux practice appears to make a "container" just an arbitrarily chosen
junction of namespaces, control groups and files, which may be changed
individually within the "container".

The kernel upcall mechanism then needs to decide which set of namespaces,
etc., it must exec the appropriate upcall program.  Examples of this
include:

 (1) The DNS resolver.  The DNS cache in the kernel should probably be
     per-network namespace, but in userspace the program, its libraries and
     its config data are associated with a mount tree and a user namespace
     and it gets run in a particular pid namespace.

 (2) NFS ID mapper.  The NFS ID mapping cache should also probably be
     per-network namespace.

 (3) nfsdcltrack.  A way for NFSD to access stable storage for tracking
     of persistent state.  Again, network-namespace dependent, but also
     perhaps mount-namespace dependent.

 (4) General request-key upcalls.  Not particularly namespace dependent,
     apart from keyrings being somewhat governed by the user namespace and
     the upcall being configured by the mount namespace.

These patches are built on top of the mount context patchset so that
namespaces can be properly propagated over submounts/automounts.

These patches implement a container object that holds the following things:

 (1) Namespaces.

 (2) A root directory.

 (3) A set of processes, including a designated 'init' process.

 (4) The creator's credentials, including ownership.

 (5) A place to hang security for the container, allowing policies to be
     set per-container.

I also want to add:

 (6) Control groups.

 (7) A per-container keyring that can be added to from outside of the
     container, even once the container is live, for the provision of
     filesystem authentication/encryption keys in advance of the container
     being started.

You can get a list of containers by examining /proc/containers - but I'm
not sure how much value this gets you.  Note that the container in which
you are running is called "<current>" and you can only see other containers
that were started from within yours.  Containers are therefore effectively
hierarchical and an init_container is set up when the system boots.

Some management operations are provided:

 (1) int fd = container_create(const char *name, unsigned int flags);

     Create a container of the given name and return a handle to it as a
     file descriptor.  flags indicates what namespaces should be inherited
     from the caller and what should be replaced new.  It is possible to
     set up a container with a null root filesystem that can be mounted
     later.

 (2) int fsfd = fsopen(const char *fsname, int container_fd,
		       unsigned int flags);

     Prepare a mount context inside the container.  This uses all the
     containers namespaces instead of the caller's.

 (3) fsmount(int fsfd, int dfd, const char *path, unsigned int at_flags,
	     unsigned int flags);

     Mount a prepared superblock.  dfd can be given container_fd to use the
     container to which it refers as the root of the pathwalk.

     If path is "/" and at_flags is AT_FSMOUNT_CONTAINER_ROOT, then this
     will attempt to mount the root of the container and create a mount
     namespace for it.  The container must've been created with
     CONTAINER_NEW_EMPTY_FS_NS.

 (4) pid_t pid = fork_into_container(int container_fd);

     Create the init process in a container.  The process uses that
     container's namespaces instead of the caller's.

 (5) int sfd = container_socket(int container_fd,
				int domain, int type, int protocol);

     Create a socket inside a container.  The socket gets the container's
     namespaces.  This allows netlink operations to be called within that
     container to set it up from outside (at least in theory).

 (6) mkdirat(int dfd, ...);
     mknodat(int dfd, ...);
     openat(int dfd, ...);

     Supplying a container fd as dfd makes the pathwalk happen relative to
     the root of the container.  Note that the path must be *relative*.

And some need to be/could be added:

 (7) Directly set a container's namespaces to allow cross-container
     sharing.

 (8) Adjust the control group membership of a container.

 (9) Add a key inside a container keyring.

(10) Kill/suspend/freeze/reboot container, both from inside and out.

(11) Set container's root dir.

(12) Set the container's security policy.

(13) Allow overlayfs to access filesystems outside of the container in
     which it is being created.

Kernel upcalls are invoked in the root of the container that incurs them
rather than in the init namespace context.  There's still some awkwardness
here if you, say, share a network namespace between containers.  Either the
upcall binaries and configuration must be duplicated between sharing
containers or a container must be elected as the one in which such upcalls
will be done.

Some further thoughts:

 (*) Should there be an AT_IN_CONTAINER flag to provide to syscalls that
     take a container in lieu of AT_FDCWD or a directory fd?  The problem
     is that such as mkdirat() and openat() don't have an at_flags
     argument.

 (*) Should there be a container hierarchy at all?  It seems that this is
     only really necessary for /proc/containers.  Do we want to allow
     containers-within-containers?

 (*) Should each container automatically have its own pid namespace such
     that its 'init' process always appears as pid 1?

 (*) Does this allow kernel upcalls to be accounted against the correct
     control group?

 (*) Should each container have a 'list' of accessible device numbers such
     that certain device files can be made usable within a container?  And
     can devtmpfs/udev be made to show the correct file set for each
     container?

The patches can be found here also:

	http://git.kernel.org/cgit/linux/kernel/git/dhowells/linux-fs.git/log/?h=container

Note that this is dependent on the mount-context branch.

David
---
David Howells (9):
      containers: Rename linux/container.h to linux/container_dev.h
      Implement containers as kernel objects
      Provide /proc/containers
      Allow processes to be forked and upcalled into a container
      Open a socket inside a container
      Allow fs syscall dfd arguments to take a container fd
      Make fsopen() able to initiate mounting into a container
      Honour CONTAINER_NEW_EMPTY_FS_NS
      Sample program for driving container objects

 arch/x86/entry/syscalls/syscall_32.tbl |    3 
 arch/x86/entry/syscalls/syscall_64.tbl |    3 
 drivers/acpi/container.c               |    2 
 drivers/base/container.c               |    2 
 fs/fsopen.c                            |   33 +-
 fs/libfs.c                             |    3 
 fs/namei.c                             |   52 ++-
 fs/namespace.c                         |  108 +++++-
 fs/nfs/namespace.c                     |    2 
 fs/nfs/nfs4namespace.c                 |    4 
 fs/proc/root.c                         |   13 +
 fs/sb_config.c                         |   29 +-
 include/linux/container.h              |   91 ++++-
 include/linux/container_dev.h          |   25 +
 include/linux/cred.h                   |    3 
 include/linux/init_task.h              |    4 
 include/linux/kmod.h                   |    1 
 include/linux/lsm_hooks.h              |   25 +
 include/linux/mount.h                  |    5 
 include/linux/nsproxy.h                |    7 
 include/linux/pid.h                    |    5 
 include/linux/proc_ns.h                |    3 
 include/linux/sb_config.h              |    5 
 include/linux/sched.h                  |    3 
 include/linux/sched/task.h             |    4 
 include/linux/security.h               |   20 +
 include/linux/syscalls.h               |    6 
 include/uapi/linux/container.h         |   28 ++
 include/uapi/linux/fcntl.h             |    2 
 include/uapi/linux/magic.h             |    1 
 init/Kconfig                           |    7 
 init/main.c                            |    4 
 kernel/Makefile                        |    2 
 kernel/container.c                     |  576 ++++++++++++++++++++++++++++++++
 kernel/cred.c                          |   45 ++-
 kernel/exit.c                          |    1 
 kernel/fork.c                          |  117 ++++++-
 kernel/kmod.c                          |   13 +
 kernel/kthread.c                       |    3 
 kernel/namespaces.h                    |   15 +
 kernel/nsproxy.c                       |   34 +-
 kernel/pid.c                           |    4 
 kernel/sys_ni.c                        |    5 
 net/socket.c                           |   37 ++
 samples/containers/test-container.c    |  162 +++++++++
 security/security.c                    |   18 +
 security/selinux/hooks.c               |    5 
 47 files changed, 1408 insertions(+), 132 deletions(-)
 create mode 100644 include/linux/container_dev.h
 create mode 100644 include/uapi/linux/container.h
 create mode 100644 kernel/container.c
 create mode 100644 kernel/namespaces.h
 create mode 100644 samples/containers/test-container.c

--
To unsubscribe from this list: send the line "unsubscribe cgroups" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html