On Sun, Sep 26, 2021 at 12:35 AM 台运方 <yunfangtai09@xxxxxxxxx> wrote: > > Hi folks, > We found that the usage counter of containers with memory cgroup v1 is > not consistent with the memory usage of processes when using THP. > > It is introduced in upstream 0a31bc97c80 patch and still exists in > Linux 5.14.5. > The root cause is that mem_cgroup_uncharge is moved to the final > put_page(). When freeing parts of huge pages in THP, the memory usage > of process is updated when pte unmapped and the usage counter of > memory cgroup is updated when splitting huge pages in > deferred_split_scan. This causes the inconsistencies and we could find > more than 30GB memory difference in our daily usage. IMHO I don't think this is a bug. The disparity reflects the difference in how the page life cycle is viewed between process and cgroup. The usage of process comes from the rss_counter of mm. It tracks the per-process mapped memory usage. So it is updated once the page is zapped. But from the point of cgroup, the page is charged when it is allocated and uncharged when it is freed. The page may be zapped by one process, but there might be other users pin the page to prevent it from being freed. The pin may be very transient or may be indefinite. THP is one of the pins. It is gone when the THP is split, but the split may happen a long time after the page is zapped due to deferred split. > > It is reproduced with the following program and script. > The program named "eat_memory_release" allocates every 8 MB memory and > releases the last 1 MB memory using madvise. > The script "test_thp.sh" creates a memory cgroup, runs > "eat_memory_release 500" in it and loops the proceed by 10 times. The > output shows the changing of memory, which should be about 500M memory > less in theory. > The outputs are varying randomly when using THP, while adding "echo 2 > > /proc/sys/vm/drop_caches" before accounting can avoid this. > > Are there any patches to fix it or is it normal by design? > > Thanks, > Yunfang Tai