Here are a set of patches to define a container object for the kernel and to provide some methods to create and manipulate them. The reason I think this is necessary is that the kernel has no idea how to direct upcalls to what userspace considers to be a container - current Linux practice appears to make a "container" just an arbitrarily chosen junction of namespaces, control groups and files, which may be changed individually within the "container". The kernel upcall mechanism then needs to decide which set of namespaces, etc., it must exec the appropriate upcall program. Examples of this include: (1) The DNS resolver. The DNS cache in the kernel should probably be per-network namespace, but in userspace the program, its libraries and its config data are associated with a mount tree and a user namespace and it gets run in a particular pid namespace. (2) NFS ID mapper. The NFS ID mapping cache should also probably be per-network namespace. (3) nfsdcltrack. A way for NFSD to access stable storage for tracking of persistent state. Again, network-namespace dependent, but also perhaps mount-namespace dependent. (4) General request-key upcalls. Not particularly namespace dependent, apart from keyrings being somewhat governed by the user namespace and the upcall being configured by the mount namespace. These patches are built on top of the mount context patchset so that namespaces can be properly propagated over submounts/automounts. These patches implement a container object that holds the following things: (1) Namespaces. (2) A root directory. (3) A set of processes, including a designated 'init' process. (4) The creator's credentials, including ownership. (5) A place to hang security for the container, allowing policies to be set per-container. I also want to add: (6) Control groups. (7) A per-container keyring that can be added to from outside of the container, even once the container is live, for the provision of filesystem authentication/encryption keys in advance of the container being started. You can get a list of containers by examining /proc/containers - but I'm not sure how much value this gets you. Note that the container in which you are running is called "<current>" and you can only see other containers that were started from within yours. Containers are therefore effectively hierarchical and an init_container is set up when the system boots. Some management operations are provided: (1) int fd = container_create(const char *name, unsigned int flags); Create a container of the given name and return a handle to it as a file descriptor. flags indicates what namespaces should be inherited from the caller and what should be replaced new. It is possible to set up a container with a null root filesystem that can be mounted later. (2) int fsfd = fsopen(const char *fsname, int container_fd, unsigned int flags); Prepare a mount context inside the container. This uses all the containers namespaces instead of the caller's. (3) fsmount(int fsfd, int dfd, const char *path, unsigned int at_flags, unsigned int flags); Mount a prepared superblock. dfd can be given container_fd to use the container to which it refers as the root of the pathwalk. If path is "/" and at_flags is AT_FSMOUNT_CONTAINER_ROOT, then this will attempt to mount the root of the container and create a mount namespace for it. The container must've been created with CONTAINER_NEW_EMPTY_FS_NS. (4) pid_t pid = fork_into_container(int container_fd); Create the init process in a container. The process uses that container's namespaces instead of the caller's. (5) int sfd = container_socket(int container_fd, int domain, int type, int protocol); Create a socket inside a container. The socket gets the container's namespaces. This allows netlink operations to be called within that container to set it up from outside (at least in theory). (6) mkdirat(int dfd, ...); mknodat(int dfd, ...); openat(int dfd, ...); Supplying a container fd as dfd makes the pathwalk happen relative to the root of the container. Note that the path must be *relative*. And some need to be/could be added: (7) Directly set a container's namespaces to allow cross-container sharing. (8) Adjust the control group membership of a container. (9) Add a key inside a container keyring. (10) Kill/suspend/freeze/reboot container, both from inside and out. (11) Set container's root dir. (12) Set the container's security policy. (13) Allow overlayfs to access filesystems outside of the container in which it is being created. Kernel upcalls are invoked in the root of the container that incurs them rather than in the init namespace context. There's still some awkwardness here if you, say, share a network namespace between containers. Either the upcall binaries and configuration must be duplicated between sharing containers or a container must be elected as the one in which such upcalls will be done. Some further thoughts: (*) Should there be an AT_IN_CONTAINER flag to provide to syscalls that take a container in lieu of AT_FDCWD or a directory fd? The problem is that such as mkdirat() and openat() don't have an at_flags argument. (*) Should there be a container hierarchy at all? It seems that this is only really necessary for /proc/containers. Do we want to allow containers-within-containers? (*) Should each container automatically have its own pid namespace such that its 'init' process always appears as pid 1? (*) Does this allow kernel upcalls to be accounted against the correct control group? (*) Should each container have a 'list' of accessible device numbers such that certain device files can be made usable within a container? And can devtmpfs/udev be made to show the correct file set for each container? The patches can be found here also: http://git.kernel.org/cgit/linux/kernel/git/dhowells/linux-fs.git/log/?h=container Note that this is dependent on the mount-context branch. David --- David Howells (9): containers: Rename linux/container.h to linux/container_dev.h Implement containers as kernel objects Provide /proc/containers Allow processes to be forked and upcalled into a container Open a socket inside a container Allow fs syscall dfd arguments to take a container fd Make fsopen() able to initiate mounting into a container Honour CONTAINER_NEW_EMPTY_FS_NS Sample program for driving container objects arch/x86/entry/syscalls/syscall_32.tbl | 3 arch/x86/entry/syscalls/syscall_64.tbl | 3 drivers/acpi/container.c | 2 drivers/base/container.c | 2 fs/fsopen.c | 33 +- fs/libfs.c | 3 fs/namei.c | 52 ++- fs/namespace.c | 108 +++++- fs/nfs/namespace.c | 2 fs/nfs/nfs4namespace.c | 4 fs/proc/root.c | 13 + fs/sb_config.c | 29 +- include/linux/container.h | 91 ++++- include/linux/container_dev.h | 25 + include/linux/cred.h | 3 include/linux/init_task.h | 4 include/linux/kmod.h | 1 include/linux/lsm_hooks.h | 25 + include/linux/mount.h | 5 include/linux/nsproxy.h | 7 include/linux/pid.h | 5 include/linux/proc_ns.h | 3 include/linux/sb_config.h | 5 include/linux/sched.h | 3 include/linux/sched/task.h | 4 include/linux/security.h | 20 + include/linux/syscalls.h | 6 include/uapi/linux/container.h | 28 ++ include/uapi/linux/fcntl.h | 2 include/uapi/linux/magic.h | 1 init/Kconfig | 7 init/main.c | 4 kernel/Makefile | 2 kernel/container.c | 576 ++++++++++++++++++++++++++++++++ kernel/cred.c | 45 ++- kernel/exit.c | 1 kernel/fork.c | 117 ++++++- kernel/kmod.c | 13 + kernel/kthread.c | 3 kernel/namespaces.h | 15 + kernel/nsproxy.c | 34 +- kernel/pid.c | 4 kernel/sys_ni.c | 5 net/socket.c | 37 ++ samples/containers/test-container.c | 162 +++++++++ security/security.c | 18 + security/selinux/hooks.c | 5 47 files changed, 1408 insertions(+), 132 deletions(-) create mode 100644 include/linux/container_dev.h create mode 100644 include/uapi/linux/container.h create mode 100644 kernel/container.c create mode 100644 kernel/namespaces.h create mode 100644 samples/containers/test-container.c -- To unsubscribe from this list: send the line "unsubscribe cgroups" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html