Re: [BUG] The usage of memory cgroup is not consistent with processes when using THP

Yang Shi <shy828301@xxxxxxxxx> · Mon, 27 Sep 2021 10:28:10 -0700

On Sun, Sep 26, 2021 at 12:35 AM 台运方 <yunfangtai09@xxxxxxxxx> wrote:
>
> Hi folks，
> We found that the usage counter of containers with memory cgroup v1 is
> not consistent with the  memory usage of processes when using THP.
>
> It is  introduced in upstream 0a31bc97c80 patch and still exists in
> Linux 5.14.5.
> The root cause is that mem_cgroup_uncharge is moved to the final
> put_page(). When freeing parts of huge pages in THP, the memory usage
> of process is updated  when pte unmapped  and the usage counter of
> memory cgroup is updated when  splitting huge pages in
> deferred_split_scan. This causes the inconsistencies and we could find
> more than 30GB memory difference in our daily usage.

IMHO I don't think this is a bug. The disparity reflects the
difference in how the page life cycle is viewed between process and
cgroup. The usage of process comes from the rss_counter of mm. It
tracks the per-process mapped memory usage. So it is updated once the
page is zapped.

But from the point of cgroup, the page is charged when it is allocated
and uncharged when it is freed. The page may be zapped by one process,
but there might be other users pin the page to prevent it from being
freed. The pin may be very transient or may be indefinite. THP is one
of the pins. It is gone when the THP is split, but the split may
happen a long time after the page is zapped due to deferred split.

>
> It is reproduced with the following program and script.
> The program named "eat_memory_release" allocates every 8 MB memory and
> releases the last 1 MB memory using madvise.
> The script "test_thp.sh" creates a memory cgroup, runs
> "eat_memory_release  500" in it and loops the proceed by 10 times. The
> output shows the changing of memory, which should be about 500M memory
> less in theory.
> The outputs are varying randomly when using THP, while adding  "echo 2
> > /proc/sys/vm/drop_caches" before accounting can avoid this.
>
> Are there any patches to fix it or is it normal by design?
>
> Thanks,
> Yunfang Tai