An early prototype of to demonstrate CPU namespace interface and its mechanism. The kernel provides two ways to control CPU resources for tasks 1. cgroup cpuset: A control mechanism to restrict CPUs to a task or a set of tasks attached to that group 2. syscall sched_setaffinity: A system call that can pin tasks to a set of CPUs The kernel also provides three ways to view the CPU resources available to the system: 1. sys/procfs: CPU system information is divulged through sys and proc fs, it exposes online, offline, present as well as load characteristics on the CPUs 2. syscall sched_getaffinity: A system call interface to get the cpuset affinity of tasks 3. cgroup cpuset: While cgroup is more of a control mechanism than a display mechanism, it can be viewed to retrieve the CPU restrictions applied on a group of tasks Coherency of information ------------------------ The control and the display interface is fairly disjoint with each other. Restrictions can be set through control interfaces like cgroups, while many applications legacy or otherwise get the view of the system through sysfs/procfs and allocate resources like number of threads/processes, memory allocation based on that information. This can lead to unexpected running behaviors as well as have a high impact on performance. Existing solutions to the problem include userspace tools like LXCFS which can fake the sysfs information by mounting onto the sysfs online file to be in coherence with the limits set through cgroup cpuset. However, LXCFS is an external solution and needs to be explicitly setup for applications that require it. Another concern is also that tools like LXCFS don't handle all the other display mechanism like procfs load stats. Therefore, the need of a clean interface could be advocated for. Security and fair use implications ---------------------------------- In a multi-tenant system, multiple containers may exist and information about the entire system, rather than just the resources that are restricted upon them can cause security and fair use implications such as: 1. A case where an actor can be in cognizance of the CPU node topology can schedule workloads and select CPUs such that the bus is flooded causing a Denial Of Service attack 2. A case wherein identifying the CPU system topology can help identify cores that are close to buses and peripherals such as GPUs to get an undue latency advantage from the rest of the workloads A survey RFD discusses other potential solutions and their concerns are listed here: https://lkml.org/lkml/2021/7/22/204 This prototype patchset introduces a new kernel namespace mechanism -- CPU namespace. The CPU namespace isolates CPU information by virtualizing logical CPU IDs and creating a scrambled virtual CPU map of the same. It latches onto the task_struct and is the cpu translations designed to be in a flat hierarchy this means that every virtual namespace CPU maps to a physical CPU at the creation of the namespace. The advantage of a flat hierarchy is that translations are O(1) and children do not need to traverse up the tree to retrieve a translation. This namespace then allows both control and display interfaces to be CPU namespace context aware, such that a task within a namespace only gets the view and therefore control of its and view CPU resources available to it via a virtual CPU map. Experiment ---------- We designed an experiment to benchmark nginx configured with "worker_processes: auto" (which ensures that the number of processes to spawn will be derived from resources viewed on the system) and a benchmark/driver application wrk Nginx: Nginx is a web server that can also be used as a reverse proxy, load balancer, mail proxy and HTTP cache Wrk: wrk is a modern HTTP benchmarking tool capable of generating significant load when run on a single multi-core CPU Docker is used as the containerization platform of choice. The numbers gathered on IBM Power 9 CPU @ 2.979GHz with 176 CPUs and 127GB memory kernel: 5.14 Case1: vanilla kernel - cpuset 4 cpus, no optimization Case2: CPU namespace kernel - cpuset 4 cpus +-----------------------+----------+----------+-----------------+ | Metric | Case1 | Case2 | case2 vs case 1 | +-----------------------+----------+----------+-----------------+ | PIDs | 177 | 5 | 172 PIDs | | mem usage (init) (MB) | 272.8 | 11.12 | 95.92% | | mem usage (peak) (MB) | 281.3 | 20.62 | 92.66% | | Latency (avg ms) | 70.91 | 25.36 | 64.23% | | Requests/sec | 47011.05 | 47080.98 | 0.14% | | Transfer/sec (MB) | 38.11 | 38.16 | 0.13% | +-----------------------+----------+----------+-----------------+ With the CPU namespace we see the correct number of PIDs spawning corresponding to the cpuset limits set. The memory utilization drops over 92-95%, the latency reduces by 64% and the the throughput like requests and transfer per second is unchanged. Note: To utilize this new namespace in a container runtime like docker, the clone CPU namespace flag was modified to coincide with the PID namespace as they are the building blocks of containers and will always be invoked. Current shortcomings in the prototype: -------------------------------------- 1. Containers also frequently use cfs period and quotas to restrict CPU runtime also known as millicores in modern container runtimes. The RFC interface currently does not account for this in the scheme of things. 2. While /proc/stat is now namespace aware and userspace programs like top will see the CPU utilization for their view of virtual CPUs; if the system or any other application outside the namespace bumps up the CPU utilization it will still show up in sys/user time. This should ideally be shown as stolen time instead. The current implementation plugs into the display of stats rather than accounting which causes incorrect reporting of stolen time. 3. The current implementation assumes that no hotplug operations occur within a container and hence the online and present cpus within a CPU namespace are always the same and query the same CPU namespace mask 4. As this is a proof of concept, currently we do not differentiate between cgroup cpus_allowed and effective_cpus and plugs them into the same virtual CPU map of the namespace 5. As described in a fair use implication earlier, knowledge of the CPU topology can potentially be taken an misused with a flood. While scrambling the CPUset in the namespace can help by obfuscation of information, the topology can still be roughly figured out with the use of IPI latencies to determine siblings or far away cores More information about the design and a video demo of the prototype can be found here: https://pratiksampat.github.io/cpu_namespace.html Pratik R. Sampat (5): ns: Introduce CPU Namespace ns: Add scrambling functionality to CPU namespace cpuset/cpuns: Make cgroup CPUset CPU namespace aware cpu/cpuns: Make sysfs CPU namespace aware proc/cpuns: Make procfs load stats CPU namespace aware drivers/base/cpu.c | 35 ++++- fs/proc/namespaces.c | 4 + fs/proc/stat.c | 50 +++++-- include/linux/cpu_namespace.h | 159 ++++++++++++++++++++++ include/linux/nsproxy.h | 2 + include/linux/proc_ns.h | 2 + include/linux/user_namespace.h | 1 + include/uapi/linux/sched.h | 1 + init/Kconfig | 8 ++ kernel/Makefile | 1 + kernel/cgroup/cpuset.c | 57 +++++++- kernel/cpu_namespace.c | 233 +++++++++++++++++++++++++++++++++ kernel/fork.c | 2 +- kernel/nsproxy.c | 30 ++++- kernel/sched/core.c | 16 ++- kernel/ucount.c | 1 + 16 files changed, 581 insertions(+), 21 deletions(-) create mode 100644 include/linux/cpu_namespace.h create mode 100644 kernel/cpu_namespace.c -- 2.31.1