mm: thp: free memory is less than expected due to partial unmap.

wangbin <wangbin224@xxxxxxxxxx> · Tue, 22 Dec 2020 17:21:04 +0800

In commit eef1b3ba053a("thp: implement split_huge_pmd()"):

    "Original split_huge_page() combined two operations: splitting PMDs into
    tables of PTEs and splitting underlying compound page.  This patch
    implements split_huge_pmd() which split given PMD without splitting
    other PMDs this page mapped with or underlying compound page."

In this situation, suppose a process is allocated a large number of transparent
huge pages and it releases partial memory of these huge pages later. The memory
occupied by the process will decrease after split_huge_pmd(). However, the
free memory of the system may not increase because the huge page has not been
split. In addition, the rss in the memory.stat of the cgroup which the process
belongs to is much larger than expected.

This causes some problems:

  - Users cannot get exact size of free memory to evaluate the system's
  workloads.
  - The memory usage of service is unstable due to unpredictable partial unmap
  of transparent huge pages. We are not sure if there is memory leak or other
  problems.

Here is an example:
# cat memory.stat
...
rss 297230336
rss_huge 230686720
...
# echo 2 > /proc/sys/vm/drop_caches (this can split some transpanrent huge pages)
# cat memory.stat
...
rss 118128640
rss_huge 27262976
...
As memory.stat shows, memory usage before split huge pages is more than twice
the actual memory usage.

Two possible solutions:

  - Provide the split_huge_page_pmd() again and add a sysfs interface for users
  to choose split_huge_page_pmd() or split_huge_pmd() when releasing memory
  of transparent huge pages.
  - Add a statistics item to /proc/meminfo and memory cgroup to display the
  memory released of partial unmap so that users can calculate the actual free
  memory of the current system.

I haven't implemented the patch yet. Hope there's a better solution.