On Wed, Sep 27, 2023 at 9:44 AM Johannes Weiner <hannes@xxxxxxxxxxx> wrote: > > On Wed, Sep 27, 2023 at 02:50:10PM +0200, Michal Hocko wrote: > > On Tue 26-09-23 18:14:14, Johannes Weiner wrote: > > [...] > > > The fact that memory consumed by hugetlb is currently not considered > > > inside memcg (host memory accounting and control) is inconsistent. It > > > has been quite confusing to our service owners and complicating things > > > for our containers team. > > > > I do understand how that is confusing and inconsistent as well. Hugetlb > > is bringing throughout its existence I am afraid. > > > > As noted in other reply though I am not sure hugeltb pool can be > > reasonably incorporated with a sane semantic. Neither of the regular > > allocation nor the hugetlb reservation/actual use can fallback to the > > pool of the other. This makes them 2 different things each hitting their > > own failure cases that require a dedicated handling. > > > > Just from top of my head these are cases I do not see easy way out from: > > - hugetlb charge failure has two failure modes - pool empty > > or memcg limit reached. The former is not recoverable and > > should fail without any further intervention the latter might > > benefit from reclaiming. > > - !hugetlb memory charge failure cannot consider any hugetlb > > pages - they are implicit memory.min protection so it is > > impossible to manage reclaim protection without having a > > knowledge of the hugetlb use. > > - there is no way to control the hugetlb pool distribution by > > memcg limits. How do we distinguish reservations from actual > > use? > > - pre-allocated pool is consuming memory without any actual > > owner until it is actually used and even that has two stages > > (reserved and really used). This makes it really hard to > > manage memory as whole when there is a considerable amount of > > hugetlb memore preallocated. > > It's important to distinguish hugetlb access policy from memory use > policy. This patch isn't about hugetlb access, it's about general > memory use. > > Hugetlb access policy is a separate domain with separate > answers. Preallocating is a privileged operation, for access control > there is the hugetlb cgroup controller etc. > > What's missing is that once you get past the access restrictions and > legitimately get your hands on huge pages, that memory use gets > reflected in memory.current and exerts pressure on *other* memory > inside the group, such as anon or optimistic cache allocations. > > Note that hugetlb *can* be allocated on demand. It's unexpected that > when an application optimistically allocates a couple of 2M hugetlb > pages those aren't reflected in its memory.current. The same is true > for hugetlb_cma. If the gigantic pages aren't currently allocated to a > cgroup, that CMA memory can be used for movable memory elsewhere. > > The points you and Frank raise are reasons and scenarios where > additional hugetlb access control is necessary - preallocation, > limited availability of 1G pages etc. But they're not reasons against > charging faulted in hugetlb to the memcg *as well*. > > My point is we need both. One to manage competition over hugetlb, > because it has unique limitations. The other to manage competition > over host memory which hugetlb is a part of. > > Here is a usecase from our fleet. > > Imagine a configuration with two 32G containers. The machine is booted > with hugetlb_cma=6G, and each container may or may not use up to 3 > gigantic page, depending on the workload within it. The rest is anon, > cache, slab, etc. You set the hugetlb cgroup limit of each cgroup to > 3G to enforce hugetlb fairness. But how do you configure memory.max to > keep *overall* consumption, including anon, cache, slab etc. fair? > > If used hugetlb is charged, you can just set memory.max=32G regardless > of the workload inside. > > Without it, you'd have to constantly poll hugetlb usage and readjust > memory.max! Yep, and I'd like to add that this could and have caused issues in our production system, when there is a delay in memory limits (low or max) correction. The userspace agent in charge of correcting these only runs periodically, and within consecutive runs the system could be in an over/underprotected state. An instantaneous charge towards the memory controller would close this gap. I think we need both a HugeTLB controller and memory controller accounting.