On Wed, Nov 11, 2020 at 3:23 AM Mike Kravetz <mike.kravetz@xxxxxxxxxx> wrote: > > > Thanks for continuing to work this Muchun! > > On 11/8/20 6:10 AM, Muchun Song wrote: > ... > > For tail pages, the value of compound_head is the same. So we can reuse > > first page of tail page structs. We map the virtual addresses of the > > remaining 6 pages of tail page structs to the first tail page struct, > > and then free these 6 pages. Therefore, we need to reserve at least 2 > > pages as vmemmap areas. > > > > When a hugetlbpage is freed to the buddy system, we should allocate six > > pages for vmemmap pages and restore the previous mapping relationship. > > > > If we uses the 1G hugetlbpage, we can save 4095 pages. This is a very > > substantial gain. > > Is that 4095 number accurate? Are we not using two pages of struct pages > as in the 2MB case? Oh, yeah, here should be 4094 and subtract page tables. For a 1GB HugeTLB page, it should be 4086 pages. Thanks for pointing out this problem. > > Also, because we are splitting the huge page mappings in the vmemmap > additional PTE pages will need to be allocated. Therefore, some additional > page table pages may need to be allocated so that we can free the pages > of struct pages. The net savings may be less than what is stated above. > > Perhaps this should mention that allocation of additional page table pages > may be required? Yeah, you are right. In the later patch, I will rework the analysis here. Make it more clear and accurate. > > ... > > Because there are vmemmap page tables reconstruction on the freeing/allocating > > path, it increases some overhead. Here are some overhead analysis. > > > > 1) Allocating 10240 2MB hugetlb pages. > > > > a) With this patch series applied: > > # time echo 10240 > /proc/sys/vm/nr_hugepages > > > > real 0m0.166s > > user 0m0.000s > > sys 0m0.166s > > > > # bpftrace -e 'kprobe:alloc_fresh_huge_page { @start[tid] = nsecs; } kretprobe:alloc_fresh_huge_page /@start[tid]/ { @latency = hist(nsecs - @start[tid]); delete(@start[tid]); }' > > Attaching 2 probes... > > > > @latency: > > [8K, 16K) 8360 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| > > [16K, 32K) 1868 |@@@@@@@@@@@ | > > [32K, 64K) 10 | | > > [64K, 128K) 2 | | > > > > b) Without this patch series: > > # time echo 10240 > /proc/sys/vm/nr_hugepages > > > > real 0m0.066s > > user 0m0.000s > > sys 0m0.066s > > > > # bpftrace -e 'kprobe:alloc_fresh_huge_page { @start[tid] = nsecs; } kretprobe:alloc_fresh_huge_page /@start[tid]/ { @latency = hist(nsecs - @start[tid]); delete(@start[tid]); }' > > Attaching 2 probes... > > > > @latency: > > [4K, 8K) 10176 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| > > [8K, 16K) 62 | | > > [16K, 32K) 2 | | > > > > Summarize: this feature is about ~2x slower than before. > > > > 2) Freeing 10240 @MB hugetlb pages. > > > > a) With this patch series applied: > > # time echo 0 > /proc/sys/vm/nr_hugepages > > > > real 0m0.004s > > user 0m0.000s > > sys 0m0.002s > > > > # bpftrace -e 'kprobe:__free_hugepage { @start[tid] = nsecs; } kretprobe:__free_hugepage /@start[tid]/ { @latency = hist(nsecs - @start[tid]); delete(@start[tid]); }' > > Attaching 2 probes... > > > > @latency: > > [16K, 32K) 10240 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| > > > > b) Without this patch series: > > # time echo 0 > /proc/sys/vm/nr_hugepages > > > > real 0m0.077s > > user 0m0.001s > > sys 0m0.075s > > > > # bpftrace -e 'kprobe:__free_hugepage { @start[tid] = nsecs; } kretprobe:__free_hugepage /@start[tid]/ { @latency = hist(nsecs - @start[tid]); delete(@start[tid]); }' > > Attaching 2 probes... > > > > @latency: > > [4K, 8K) 9950 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| > > [8K, 16K) 287 |@ | > > [16K, 32K) 3 | | > > > > Summarize: The overhead of __free_hugepage is about ~2-4x slower than before. > > But according to the allocation test above, I think that here is > > also ~2x slower than before. > > > > But why the 'real' time of patched is smaller than before? Because > > In this patch series, the freeing hugetlb is asynchronous(through > > kwoker). > > > > Although the overhead has increased. But the overhead is not on the > > allocating/freeing of each hugetlb page, it is only once when we reserve > > some hugetlb pages through /proc/sys/vm/nr_hugepages. Once the reservation > > is successful, the subsequent allocating, freeing and using are the same > > as before (not patched). So I think that the overhead is acceptable. > > Thank you for benchmarking. There are still some instances where huge pages > are allocated 'on the fly' instead of being pulled from the pool. Michal > pointed out the case of page migration. It is also possible for someone to > use hugetlbfs without pre-allocating huge pages to the pool. I remember the > use case pointed out in commit 099730d67417. It says, "I have a hugetlbfs > user which is never explicitly allocating huge pages with 'nr_hugepages'. > They only set 'nr_overcommit_hugepages' and then let the pages be allocated > from the buddy allocator at fault time." In this case, I suspect they were > using 'page fault' allocation for initialization much like someone using > /proc/sys/vm/nr_hugepages. So, the overhead may not be as noticeable. Thanks for pointing out this using case. > > -- > Mike Kravetz -- Yours, Muchun