Re: [PATCH v6 4/4] mm: hugetlb_vmemmap: add hugetlb_free_vmemmap sysctl

Muchun Song <songmuchun@xxxxxxxxxxxxx> · Thu, 31 Mar 2022 11:45:29 +0800

On Thu, Mar 31, 2022 at 10:37 AM Andrew Morton
<akpm@xxxxxxxxxxxxxxxxxxxx> wrote:
>
> On Wed, 30 Mar 2022 23:37:45 +0800 Muchun Song <songmuchun@xxxxxxxxxxxxx> wrote:
>
> > We must add "hugetlb_free_vmemmap=on" to boot cmdline and reboot the
> > server to enable the feature of freeing vmemmap pages of HugeTLB
> > pages.  Rebooting usually takes a long time.  Add a sysctl to enable
> > or disable the feature at runtime without rebooting.
>
> I forget, why did we add the hugetlb_free_vmemmap option in the first
> place? Why not just leave the feature enabled in all cases?

The 1st reason is because we disable PMD/huge page mapping
of vmemmap pages (in the original version) which increase page
table pages.  So if a user/sysadmin only  uses a small number of
HugeTLB pages (as a percentage of system memory), they could
end up using more memory with hugetlb_free_vmemmap on as
opposed to off.  Now this tradeoff is gone.

The 2nd reason is this feature adds more overhead in the path of
HugeTLB allocation/freeing from/to the buddy system.  As Mike said
in the link [1].
"
There are still some instances where huge pages
are allocated 'on the fly' instead of being pulled from the pool.  Michal
pointed out the case of page migration.  It is also possible for someone to
use hugetlbfs without pre-allocating huge pages to the pool.  I remember the
use case pointed out in commit 099730d67417.  It says, "I have a hugetlbfs
user which is never explicitly allocating huge pages with 'nr_hugepages'.
They only set 'nr_overcommit_hugepages' and then let the pages be allocated
from the buddy allocator at fault time."  In this case, I suspect they were
using 'page fault' allocation for initialization much like someone using
/proc/sys/vm/nr_hugepages.  So, the overhead may not be as noticeable.
"

For those different workloads, we introduce hugetlb_free_vmemmap and
expect users to make decisions based on their workloads.

[1] https://patchwork.kernel.org/comment/23752641/

>
> Furthermore, why would anyone want to tweak this at runtime?  What is
> the use case?  Where is the end-user value in all of this?

If the workload is changed in the future on a server.  The users need to
adapt this at runtime without rebooting the server.

>
> > Disabling requires there is no any optimized HugeTLB page in the
> > system.  If you fail to disable it, you can set "nr_hugepages" to 0
> > and then retry.
> >
> > --- a/Documentation/admin-guide/sysctl/vm.rst
> > +++ b/Documentation/admin-guide/sysctl/vm.rst
> > @@ -561,6 +561,20 @@ Change the minimum size of the hugepage pool.
> >  See Documentation/admin-guide/mm/hugetlbpage.rst
> >
> >
> > +hugetlb_free_vmemmap
> > +====================
> > +
> > +Enable (set to 1) or disable (set to 0) the feature of optimizing vmemmap
> > +pages associated with each HugeTLB page.  Once true, the vmemmap pages of
> > +subsequent allocation of HugeTLB pages from buddy system will be optimized,
> > +whereas already allocated HugeTLB pages will not be optimized.  If you fail
> > +to disable this feature, you can set "nr_hugepages" to 0 and then retry
> > +since it is only allowed to be disabled after there is no any optimized
> > +HugeTLB page in the system.
> > +
>
> Pity the poor user who is looking at this and wondering whether it will
> improve or worsen things.  If we don't tell them, who will?  Are they
> supposed to just experiment?
>
> What can we add here to help them understand whether this might be
> beneficial?
>

My bad. I should explain more details to let users make better decisions.

Thanks.