Hi folks, We found that the usage counter of containers with memory cgroup v1 is not consistent with the memory usage of processes when using THP. It is introduced in upstream 0a31bc97c80 patch and still exists in Linux 5.14.5. The root cause is that mem_cgroup_uncharge is moved to the final put_page(). When freeing parts of huge pages in THP, the memory usage of process is updated when pte unmapped and the usage counter of memory cgroup is updated when splitting huge pages in deferred_split_scan. This causes the inconsistencies and we could find more than 30GB memory difference in our daily usage. It is reproduced with the following program and script. The program named "eat_memory_release" allocates every 8 MB memory and releases the last 1 MB memory using madvise. The script "test_thp.sh" creates a memory cgroup, runs "eat_memory_release 500" in it and loops the proceed by 10 times. The output shows the changing of memory, which should be about 500M memory less in theory. The outputs are varying randomly when using THP, while adding "echo 2 > /proc/sys/vm/drop_caches" before accounting can avoid this. Are there any patches to fix it or is it normal by design? Thanks, Yunfang Tai
#include <stdio.h> #include <stdlib.h> #include <unistd.h> #include <sys/types.h> #include <sys/mman.h> int main(int argc, char* argv[]) { char* memindex[1000] = {0}; int eat = 0; int wait = 0; int i = 0; if (argc < 2) { printf("Usage: ./eat_release_memory <num> #allocate num * 8 MB and free num MB memory\n"); return; } sscanf(argv[1], "%d", &eat); if (eat <= 0 || eat >= 1000) { printf("num should larger than 0 and less than 1000\n"); return; } printf("Allocate memory in MB size: %d\n", eat * 8); printf("Allocation memory Begin!\n"); for (i = 0; i < eat; i++) { memindex[i] = (char*)mmap(NULL, 8*1024*1024, PROT_READ|PROT_WRITE, MAP_ANONYMOUS|MAP_PRIVATE, -1, 0); memset(memindex[i], 0, 8*1024*1024); } printf("Allocation memory Done!\n"); sleep(2); printf("Now begin to madvise free memory!\n"); for (i = 0; i < eat; i++) { madvise(memindex[i] + 7*1024*1024, 1024*1024, MADV_DONTNEED); } sleep(5); printf("Now begin to release memory!\n"); for (i = 0; i < eat; i++) { munmap(memindex[i], 8*1024*1024); } }
Attachment:
test_thp.sh
Description: Bourne shell script