On Mon, Dec 25, 2023 at 11:12 AM Yafang Shao <laoar.shao@xxxxxxxxx> wrote: > > On Mon, Dec 25, 2023 at 3:44 AM Paul Moore <paul@xxxxxxxxxxxxxx> wrote: > > > > On Sat, Dec 23, 2023 at 10:35 PM Yafang Shao <laoar.shao@xxxxxxxxx> wrote: > > > On Sat, Dec 23, 2023 at 8:16 AM Paul Moore <paul@xxxxxxxxxxxxxx> wrote: > > > > On Thu, Dec 14, 2023 at 7:51 AM Yafang Shao <laoar.shao@xxxxxxxxx> wrote: > > > > > > > > > > Background > > > > > ========== > > > > > > > > > > In our containerized environment, we've identified unexpected OOM events > > > > > where the OOM-killer terminates tasks despite having ample free memory. > > > > > This anomaly is traced back to tasks within a container using mbind(2) to > > > > > bind memory to a specific NUMA node. When the allocated memory on this node > > > > > is exhausted, the OOM-killer, prioritizing tasks based on oom_score, > > > > > indiscriminately kills tasks. > > > > > > > > > > The Challenge > > > > > ============= > > > > > > > > > > In a containerized environment, independent memory binding by a user can > > > > > lead to unexpected system issues or disrupt tasks being run by other users > > > > > on the same server. If a user genuinely requires memory binding, we will > > > > > allocate dedicated servers to them by leveraging kubelet deployment. > > > > > > > > > > Currently, users possess the ability to autonomously bind their memory to > > > > > specific nodes without explicit agreement or authorization from our end. > > > > > It's imperative that we establish a method to prevent this behavior. > > > > > > > > > > Proposed Solution > > > > > ================= > > > > > > > > > > - Capability > > > > > Currently, any task can perform MPOL_BIND without specific capabilities. > > > > > Enforcing CAP_SYS_RESOURCE or CAP_SYS_NICE could be an option, but this > > > > > may have unintended consequences. Capabilities, being broad, might grant > > > > > unnecessary privileges. We should explore alternatives to prevent > > > > > unexpected side effects. > > > > > > > > > > - LSM > > > > > Introduce LSM hooks for syscalls such as mbind(2) and set_mempolicy(2) > > > > > to disable MPOL_BIND. This approach is more flexibility and allows for > > > > > fine-grained control without unintended consequences. A sample LSM BPF > > > > > program is included, demonstrating practical implementation in a > > > > > production environment. > > > > > > > > > > - seccomp > > > > > seccomp is relatively heavyweight, making it less suitable for > > > > > enabling in our production environment: > > > > > - Both kubelet and containers need adaptation to support it. > > > > > - Dynamically altering security policies for individual containers > > > > > without interrupting their operations isn't straightforward. > > > > > > > > > > Future Considerations > > > > > ===================== > > > > > > > > > > In addition, there's room for enhancement in the OOM-killer for cases > > > > > involving CONSTRAINT_MEMORY_POLICY. It would be more beneficial to > > > > > prioritize selecting a victim that has allocated memory on the same NUMA > > > > > node. My exploration on the lore led me to a proposal[0] related to this > > > > > matter, although consensus seems elusive at this point. Nevertheless, > > > > > delving into this specific topic is beyond the scope of the current > > > > > patchset. > > > > > > > > > > [0]. https://lore.kernel.org/lkml/20220512044634.63586-1-ligang.bdlg@xxxxxxxxxxxxx/ > > > > > > > > > > Changes: > > > > > - v4 -> v5: > > > > > - Revise the commit log in patch #5. (KP) > > > > > - v3 -> v4: https://lwn.net/Articles/954126/ > > > > > - Drop the changes around security_task_movememory (Serge) > > > > > - RCC v2 -> v3: https://lwn.net/Articles/953526/ > > > > > - Add MPOL_F_NUMA_BALANCING man-page (Ying) > > > > > - Fix bpf selftests error reported by bot+bpf-ci > > > > > - RFC v1 -> RFC v2: https://lwn.net/Articles/952339/ > > > > > - Refine the commit log to avoid misleading > > > > > - Use one common lsm hook instead and add comment for it > > > > > - Add selinux implementation > > > > > - Other improments in mempolicy > > > > > - RFC v1: https://lwn.net/Articles/951188/ > > > > > > > > > > Yafang Shao (5): > > > > > mm, doc: Add doc for MPOL_F_NUMA_BALANCING > > > > > mm: mempolicy: Revise comment regarding mempolicy mode flags > > > > > mm, security: Add lsm hook for memory policy adjustment > > > > > security: selinux: Implement set_mempolicy hook > > > > > selftests/bpf: Add selftests for set_mempolicy with a lsm prog > > > > > > > > > > .../admin-guide/mm/numa_memory_policy.rst | 27 +++++++ > > > > > include/linux/lsm_hook_defs.h | 3 + > > > > > include/linux/security.h | 9 +++ > > > > > include/uapi/linux/mempolicy.h | 2 +- > > > > > mm/mempolicy.c | 8 +++ > > > > > security/security.c | 13 ++++ > > > > > security/selinux/hooks.c | 8 +++ > > > > > security/selinux/include/classmap.h | 2 +- > > > > > .../selftests/bpf/prog_tests/set_mempolicy.c | 84 ++++++++++++++++++++++ > > > > > .../selftests/bpf/progs/test_set_mempolicy.c | 28 ++++++++ > > > > > 10 files changed, 182 insertions(+), 2 deletions(-) > > > > > create mode 100644 tools/testing/selftests/bpf/prog_tests/set_mempolicy.c > > > > > create mode 100644 tools/testing/selftests/bpf/progs/test_set_mempolicy.c > > > > > > > > In your original patchset there was a lot of good discussion about > > > > ways to solve, or mitigate, this problem using existing mechanisms; > > > > while you disputed many (all?) of those suggestions, I felt that they > > > > still had merit over your objections. > > > > > > JFYI. The initial patchset presents three suggestions: > > > - Disabling CONFIG_NUMA, proposed by Michal: > > > By default, tasks on a server allocate memory from their local > > > memory node initially. Disabling CONFIG_NUMA could potentially lead to > > > a performance hit. > > > > > > - Adjusting NUMA workload configuration, also from Michal: > > > This adjustment has been successfully implemented on some dedicated > > > clusters, as mentioned in the commit log. However, applying this > > > change universally across a large fleet of servers might result in > > > significant wastage of physical memory. > > > > > > - Implementing seccomp, suggested by Ondrej and Casey: > > > As indicated in the commit log, altering the security policy > > > dynamically without interrupting a running container isn't > > > straightforward. Implementing seccomp requires the introduction of an > > > eBPF-based seccomp, which constitutes a substantial change. > > > [ The seccomp maintainer has been added to this mail thread for > > > further discussion. ] > > > > The seccomp filter runs cBFF (classic BPF) and not eBPF; there are a > > number of sandboxing tools designed to make this easier to use, > > including systemd, and if you need to augment your existing > > application there are libraries available to make this easier. > > Let's delve into how cBPF-based seccomp operates with runc [0] - our > application: > > 1. Create a seccomp filter in /path/to/seccomp/profile.json. > 2. Initiate a container with this filter rule using > docker run --rm \ > -it \ > --security-opt seccomp=/path/to/seccomp/profile.json \ > hello-world > > However, modifying or removing the seccomp filter mandates stopping > the running container and repeating the aforementioned steps. This > interruption isn't desirable for us. > > The inability to dynamically alter the seccomp filter with cBPF arises > from the kernel lacking a method to unload the seccomp once attached > to a task. In other words, cBPF-based seccomp cannot dynamically > attach and detach from tasks. Please correct me if my understanding is > incorrect. > > [0]. https://docs.docker.com/engine/security/seccomp/ Paul, Do you have any additional comments or further suggestions? -- Regards Yafang