Hello, On Fri, Apr 16, 2021 at 06:26:15PM +0530, Pratik Sampat wrote: > Hello Roman, > > I've tried the v3 patch series on a POWER9 and an x86 KVM setup. > > My results of the percpu_test are as follows: > Intel KVM 4CPU:4G > Vanilla 5.12-rc6 > # ./percpu_test.sh > Percpu: 1952 kB > Percpu: 219648 kB > Percpu: 219648 kB > > 5.12-rc6 + with patchset applied > # ./percpu_test.sh > Percpu: 2080 kB > Percpu: 219712 kB > Percpu: 72672 kB > > I'm able to see improvement comparable to that of what you're see too. > > However, on POWERPC I'm unable to reproduce these improvements with the patchset in the same configuration > > POWER9 KVM 4CPU:4G > Vanilla 5.12-rc6 > # ./percpu_test.sh > Percpu: 5888 kB > Percpu: 118272 kB > Percpu: 118272 kB > > 5.12-rc6 + with patchset applied > # ./percpu_test.sh > Percpu: 6144 kB > Percpu: 119040 kB > Percpu: 119040 kB > > I'm wondering if there's any architectural specific code that needs plumbing > here? > There shouldn't be. Can you send me the percpu_stats debug output before and after? > I will also look through the code to find the reason why POWER isn't > depopulating pages. > > Thank you, > Pratik > > On 08/04/21 9:27 am, Roman Gushchin wrote: > > In our production experience the percpu memory allocator is sometimes struggling > > with returning the memory to the system. A typical example is a creation of > > several thousands memory cgroups (each has several chunks of the percpu data > > used for vmstats, vmevents, ref counters etc). Deletion and complete releasing > > of these cgroups doesn't always lead to a shrinkage of the percpu memory, > > so that sometimes there are several GB's of memory wasted. > > > > The underlying problem is the fragmentation: to release an underlying chunk > > all percpu allocations should be released first. The percpu allocator tends > > to top up chunks to improve the utilization. It means new small-ish allocations > > (e.g. percpu ref counters) are placed onto almost filled old-ish chunks, > > effectively pinning them in memory. > > > > This patchset solves this problem by implementing a partial depopulation > > of percpu chunks: chunks with many empty pages are being asynchronously > > depopulated and the pages are returned to the system. > > > > To illustrate the problem the following script can be used: > > > > -- > > #!/bin/bash > > > > cd /sys/fs/cgroup > > > > mkdir percpu_test > > echo "+memory" > percpu_test/cgroup.subtree_control > > > > cat /proc/meminfo | grep Percpu > > > > for i in `seq 1 1000`; do > > mkdir percpu_test/cg_"${i}" > > for j in `seq 1 10`; do > > mkdir percpu_test/cg_"${i}"_"${j}" > > done > > done > > > > cat /proc/meminfo | grep Percpu > > > > for i in `seq 1 1000`; do > > for j in `seq 1 10`; do > > rmdir percpu_test/cg_"${i}"_"${j}" > > done > > done > > > > sleep 10 > > > > cat /proc/meminfo | grep Percpu > > > > for i in `seq 1 1000`; do > > rmdir percpu_test/cg_"${i}" > > done > > > > rmdir percpu_test > > -- > > > > It creates 11000 memory cgroups and removes every 10 out of 11. > > It prints the initial size of the percpu memory, the size after > > creating all cgroups and the size after deleting most of them. > > > > Results: > > vanilla: > > ./percpu_test.sh > > Percpu: 7488 kB > > Percpu: 481152 kB > > Percpu: 481152 kB > > > > with this patchset applied: > > ./percpu_test.sh > > Percpu: 7488 kB > > Percpu: 481408 kB > > Percpu: 135552 kB > > > > So the total size of the percpu memory was reduced by more than 3.5 times. > > > > v3: > > - introduced pcpu_check_chunk_hint() > > - fixed a bug related to the hint check > > - minor cosmetic changes > > - s/pretends/fixes (cc Vlastimil) > > > > v2: > > - depopulated chunks are sidelined > > - depopulation happens in the reverse order > > - depopulate list made per-chunk type > > - better results due to better heuristics > > > > v1: > > - depopulation heuristics changed and optimized > > - chunks are put into a separate list, depopulation scan this list > > - chunk->isolated is introduced, chunk->depopulate is dropped > > - rearranged patches a bit > > - fixed a panic discovered by krobot > > - made pcpu_nr_empty_pop_pages per chunk type > > - minor fixes > > > > rfc: > > https://lwn.net/Articles/850508/ > > > > > > Roman Gushchin (6): > > percpu: fix a comment about the chunks ordering > > percpu: split __pcpu_balance_workfn() > > percpu: make pcpu_nr_empty_pop_pages per chunk type > > percpu: generalize pcpu_balance_populated() > > percpu: factor out pcpu_check_chunk_hint() > > percpu: implement partial chunk depopulation > > > > mm/percpu-internal.h | 4 +- > > mm/percpu-stats.c | 9 +- > > mm/percpu.c | 306 +++++++++++++++++++++++++++++++++++-------- > > 3 files changed, 261 insertions(+), 58 deletions(-) > > > Roman, sorry for the delay. I'm looking to apply this today to for-5.14. Thanks, Dennis