On Mon, Nov 13, 2023 at 4:50 PM Ondrej Mosnacek <omosnace@xxxxxxxxxx> wrote: > > On Mon, Nov 13, 2023 at 4:17 AM Yafang Shao <laoar.shao@xxxxxxxxx> wrote: > > > > On Mon, Nov 13, 2023 at 12:45 AM Casey Schaufler <casey@xxxxxxxxxxxxxxxx> wrote: > > > > > > On 11/11/2023 11:34 PM, Yafang Shao wrote: > > > > Background > > > > ========== > > > > > > > > In our containerized environment, we've identified unexpected OOM events > > > > where the OOM-killer terminates tasks despite having ample free memory. > > > > This anomaly is traced back to tasks within a container using mbind(2) to > > > > bind memory to a specific NUMA node. When the allocated memory on this node > > > > is exhausted, the OOM-killer, prioritizing tasks based on oom_score, > > > > indiscriminately kills tasks. This becomes more critical with guaranteed > > > > tasks (oom_score_adj: -998) aggravating the issue. > > > > > > Is there some reason why you can't fix the callers of mbind(2)? > > > This looks like an user space configuration error rather than a > > > system security issue. > > > > It appears my initial description may have caused confusion. In this > > scenario, the caller is an unprivileged user lacking any capabilities. > > While a privileged user, such as root, experiencing this issue might > > indicate a user space configuration error, the concerning aspect is > > the potential for an unprivileged user to disrupt the system easily. > > If this is perceived as a misconfiguration, the question arises: What > > is the correct configuration to prevent an unprivileged user from > > utilizing mbind(2)?" > > > > > > > > > > > > > The selected victim might not have allocated memory on the same NUMA node, > > > > rendering the killing ineffective. This patch aims to address this by > > > > disabling MPOL_BIND in container environments. > > > > > > > > In the container environment, our aim is to consolidate memory resource > > > > control under the management of kubelet. If users express a preference for > > > > binding their memory to a specific NUMA node, we encourage the adoption of > > > > a standardized approach. Specifically, we recommend configuring this memory > > > > policy through kubelet using cpuset.mems in the cpuset controller, rather > > > > than individual users setting it autonomously. This centralized approach > > > > ensures that NUMA nodes are globally managed through kubelet, promoting > > > > consistency and facilitating streamlined administration of memory resources > > > > across the entire containerized environment. > > > > > > Changing system behavior for a single use case doesn't seem prudent. > > > You're introducing a bunch of kernel code to avoid fixing a broken > > > user space configuration. > > > > Currently, there is no mechanism in place to proactively prevent an > > unprivileged user from utilizing mbind(2). The approach adopted is to > > monitor mbind(2) through a BPF program and trigger an alert if its > > usage is detected. However, beyond this monitoring, the only recourse > > is to verbally communicate with the user, advising against the use of > > mbind(2). As a result, users will question why mbind(2) isn't outright > > prohibited in the first place. > > Is there a reason why you can't use syscall filtering via seccomp(2)? > AFAIK, all the mainstream container tooling already has support for > specifying seccomp filters for containers. seccomp is relatively heavyweight, making it less suitable for enabling in our production environment. In contrast, LSM offer a more lightweight and flexible alternative. Moreover, the act of binding to a specific NUMA node appears akin to a privileged operation, warranting the consideration of a dedicated LSM hook. -- Regards Yafang