On Wed, Sep 27, 2023 at 01:21:20PM +0200, Michal Hocko wrote: > On Tue 26-09-23 12:49:47, Nhat Pham wrote: > > Currently, hugetlb memory usage is not acounted for in the memory > > controller, which could lead to memory overprotection for cgroups with > > hugetlb-backed memory. This has been observed in our production system. > > > > This patch series rectifies this issue by charging the memcg when the > > hugetlb folio is allocated, and uncharging when the folio is freed. In > > addition, a new selftest is added to demonstrate and verify this new > > behavior. > > The primary reason why hugetlb is living outside of memcg (and the core > MM as well) is that it doesn't really fit the whole scheme. In several > aspects. First and the foremost it is an independently managed resource > with its own pool management, use and lifetime. Honestly, the simpler explanation is that few people have used hugetlb in regular, containerized non-HPC workloads. Hugetlb has historically been much more special, and it retains a specialness that warrants e.g. the hugetlb cgroup container. But it has also made strides with hugetlb_cma, migratability, madvise support etc. that allows much more on-demand use. It's no longer the case that you just put a static pool of memory aside during boot and only a few blessed applications are using it. For example, we're using hugetlb_cma very broadly with generic containers. The CMA region is fully usable by movable non-huge stuff until huge pages are allocated in it. With the hugetlb controller you can define a maximum number of hugetlb pages that can be used per container. But what if that container isn't using any? Why shouldn't it be allowed to use its overall memory allowance for anon and cache instead? With hugetlb being more dynamic, it becomes the same problem that we had with dedicated tcp and kmem pools. It didn't make sense to fail a random slab allocation when you still have memory headroom or can reclaim some cache. Nowadays, the same problem applies to hugetlb. > There is no notion of memory reclaim and this makes a huge difference > for the pool that might consume considerable amount of memory. While > this is the case for many kernel allocations as well they usually do not > consume considerable portions of the accounted memory. This makes it > really tricky to handle limit enforcement gracefully. I don't think that's true. For some workloads, network buffers can absolutely dominate. And they work just fine with cgroup limits. It isn't a problem that they aren't reclaimable themselves, it's just important that they put pressure on stuff that is. So that if you use 80% hugetlb, the other memory is forced to stay in the remaining 20%, or it OOMs; and that if you don't use hugetlb, the group is still allowed to use the full 100% of its host memory allowance, without requiring some outside agent continuously monitoring and adjusting the container limits. > Another important aspect comes from the lifetime semantics when a proper > reservations accounting and managing needs to handle mmap time rather > than than usual allocation path. While pages are allocated they do not > belong to anybody and only later at the #PF time (or read for the fs > backed mapping) the ownership is established. That makes it really hard > to manage memory as whole under the memcg anyway as a large part of > that pool sits without an ownership yet it cannot be used for any other > purpose. > > These and more reasons where behind the earlier decision o have a > dedicated hugetlb controller. Yeah, there is still a need for an actual hugetlb controller for the static use cases (and even for dynamic access to hugetlb_cma). But you need memcg coverage as well for the more dynamic cases to work as expected. And having that doesn't really interfere with the static usecases. > Also I will also Nack involving hugetlb pages being accounted by > default. This would break any setups which mix normal and hugetlb memory > with memcg limits applied. Yes, no disagreement there. I think we're all on the same page this needs to be opt-in, say with a cgroup mount option.