On Thu, Mar 31, 2022 at 10:37 AM Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> wrote: > > On Wed, 30 Mar 2022 23:37:45 +0800 Muchun Song <songmuchun@xxxxxxxxxxxxx> wrote: > > > We must add "hugetlb_free_vmemmap=on" to boot cmdline and reboot the > > server to enable the feature of freeing vmemmap pages of HugeTLB > > pages. Rebooting usually takes a long time. Add a sysctl to enable > > or disable the feature at runtime without rebooting. > > I forget, why did we add the hugetlb_free_vmemmap option in the first > place? Why not just leave the feature enabled in all cases? The 1st reason is because we disable PMD/huge page mapping of vmemmap pages (in the original version) which increase page table pages. So if a user/sysadmin only uses a small number of HugeTLB pages (as a percentage of system memory), they could end up using more memory with hugetlb_free_vmemmap on as opposed to off. Now this tradeoff is gone. The 2nd reason is this feature adds more overhead in the path of HugeTLB allocation/freeing from/to the buddy system. As Mike said in the link [1]. " There are still some instances where huge pages are allocated 'on the fly' instead of being pulled from the pool. Michal pointed out the case of page migration. It is also possible for someone to use hugetlbfs without pre-allocating huge pages to the pool. I remember the use case pointed out in commit 099730d67417. It says, "I have a hugetlbfs user which is never explicitly allocating huge pages with 'nr_hugepages'. They only set 'nr_overcommit_hugepages' and then let the pages be allocated from the buddy allocator at fault time." In this case, I suspect they were using 'page fault' allocation for initialization much like someone using /proc/sys/vm/nr_hugepages. So, the overhead may not be as noticeable. " For those different workloads, we introduce hugetlb_free_vmemmap and expect users to make decisions based on their workloads. [1] https://patchwork.kernel.org/comment/23752641/ > > Furthermore, why would anyone want to tweak this at runtime? What is > the use case? Where is the end-user value in all of this? If the workload is changed in the future on a server. The users need to adapt this at runtime without rebooting the server. > > > Disabling requires there is no any optimized HugeTLB page in the > > system. If you fail to disable it, you can set "nr_hugepages" to 0 > > and then retry. > > > > --- a/Documentation/admin-guide/sysctl/vm.rst > > +++ b/Documentation/admin-guide/sysctl/vm.rst > > @@ -561,6 +561,20 @@ Change the minimum size of the hugepage pool. > > See Documentation/admin-guide/mm/hugetlbpage.rst > > > > > > +hugetlb_free_vmemmap > > +==================== > > + > > +Enable (set to 1) or disable (set to 0) the feature of optimizing vmemmap > > +pages associated with each HugeTLB page. Once true, the vmemmap pages of > > +subsequent allocation of HugeTLB pages from buddy system will be optimized, > > +whereas already allocated HugeTLB pages will not be optimized. If you fail > > +to disable this feature, you can set "nr_hugepages" to 0 and then retry > > +since it is only allowed to be disabled after there is no any optimized > > +HugeTLB page in the system. > > + > > Pity the poor user who is looking at this and wondering whether it will > improve or worsen things. If we don't tell them, who will? Are they > supposed to just experiment? > > What can we add here to help them understand whether this might be > beneficial? > My bad. I should explain more details to let users make better decisions. Thanks.