Re: [PATCH 1/3] mm/hugetlb: fix the inconsistency of /proc/sys/vm/nr_huge_pages

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Tue, Aug 01, 2023 at 11:49:42AM -0700, Mike Kravetz wrote:
> On 08/01/23 20:22, Xueshi Hu wrote:
> > On Tue, Aug 1, 2023 at 6:17 AM Mike Kravetz <mike.kravetz@xxxxxxxxxx> wrote:
> > >
> > > On 07/30/23 20:51, Xueshi Hu wrote:
> > > > When writing to /proc/sys/vm/nr_huge_pages, it indicates global number of
> > > > huge pages of the default hstate. But when reading from it, it indicates
> > > > the current number of "persistent" huge pages in the kernel's huge page
> > > > pool.
> > > >
> > > > There are currently four interfaces used to export the number of huge
> > > > pages:
> > > > - /proc/meminfo
> > > > - /proc/sys/vm/*hugepages*
> > > > - /sys/devices/system/node/node0/hugepages/hugepages-2048kB/*
> > > > - /sys/kernel/mm/hugepages/hugepages-2048kB/*
> > > >
> > > > But only the /proc/sys/vm/nr_huge_pages provides the 'persistent'
> > > > semantics when reading from it. This inconsistency is very subtle and can
> > > > be easily misunderstood.
> > >
> > > Thanks for looking into this.
> > >
> > > The hugetlb documentation (./admin-guide/mm/hugetlbpage.rst) mentions
> > > the term 'persistent hugetlb pages', but never provides a definition.
> > >
> > > We can get the definition from the code as:
> > > #define persistent_huge_pages(h) (h->nr_huge_pages - h->surplus_huge_pages)
> > >
> > > Further, the documentation says:
> > > "The ``/proc/meminfo`` file provides information about the total number of
> > >  persistent hugetlb pages in the kernel's huge page pool."
> > >
> > > "``/proc/sys/vm/nr_hugepages`` indicates the current number of "persistent"
> > >  huge pages in the kernel's huge page pool."
> > >
> > > "The administrator may shrink the pool of persistent huge pages for
> > >  the default huge page size by setting the ``nr_hugepages`` sysctl to a
> > >  smaller value."
> > >
> > > So, the documentation implies that these interfaces should display the
> > > number of persistent hugetlb pages.  As you have discovered, all but the
> > > sysctl interface (and /proc/sys/vm/nr_hugepages) displays the total
> > > number of hugetlb pages rather than the number of persistent hugetlb
> > > pages.
> > >
> > > If we wanted to match the documentation, it seems we should change all
> > > the "show" interfaces to display persistent huge pages.  However, I am a
> > > bit concerned about how this may impact end users.
> > >
> > > There are two types if inconsistencies in these interfaces.
> > > 1) As this patch points out, not all "show" interfaces provide the same
> > >    information.  sysctl (/proc/sys/vm/nr_hugepages) displays the number
> > >    of persistent hugetlb pages, while the others display the total number
> > >    of hugetlb pages.
> > > 2) The show/read interfaces generally provide the total number of
> > >    hugetlb pages, and the update/write interfaces update the number of
> > >    persistent hugetlb pages.
> > >
> > > Both of these situations can lead to user confusion.  My 'guess' is that
> > > this has not been a widespread issue as most hugetlb users do not
> > > configure overcommit/surplus hugetlb pages and thus total number of
> > > hugetlb pages is the same as number of persistent hugetlb pages.
> > >
> > > Right now, I would suggest making all these interfaces display/take the
> > > number of persistent hugetlb pages for consistency.  This also matches
> > > the documentation.
> > >
> > > Thoughts?
> > I am concerned that modifying it this way may result in an weaker control
> > over hugetlb pages. Administrator will no longer be able to increase
> > surplus pages through the nr_hugepages interface.
> > 
> > Since surplus pages depend on the state of programs in the entire
> > system, adjusting nr_hugepages may lead to an unexpected number of
> > hugetlbs allocated which may leads to oom.
> 
> Sorry, I am not sure I understand your concerns.
I'm wrong, just ignore what I've said.
> 
> Currently, the interfaces to set/update the number of hugetlb pages use
> the supplied count as the number of requested persistent pages.  I am
> not suggesting any changes there (except the bug in node specific code
> you discovered).  Rather, I am suggesting that we update the interfaces
> which show the number of hugepages (nr_hugepages) to display the number
> of persistent pages to be consistent with the set/update interfaces.
I agree with you.
> 
> > About the definition of /proc/sys/vm/nr_huge_pages and meaning of
> > "persistent", the documentation is kind of ambiguous.
> > 
> > The documentation says:
> > 
> > "The ``/proc/meminfo`` file provides information about the total number of
> > persistent hugetlb pages in the kernel's huge page pool."
> > 
> > "Caveat: Shrinking the persistent huge page pool via ``nr_hugepages``
> > such that it becomes less than the number of huge pages in use will
> > convert the balance of the in-use huge pages to surplus huge pages."
> > 
> > "The ``/proc`` interfaces discussed above have been retained for backwards
> > compatibility."
> > 
> > The ambiguities are:
> > 1. HugePages_Total in /proc/meminfo is actually the total number of
> > hugetlb pages.
> 
> Correct.  Although the documentation states it is the number of
> persistent hugetlb pages.  meminfo also contains the number of surplus
> huge pages.  So, it it possible that one could see
> 
> HugePages_Total: 0
> HugePages_Surp:  100
It's easy to fix.
> 
> Ideally, one would want to know the value for overcommit hugepages as
> well.
It will be straightforward to achieve this.
> 
> The sysfs directories /sys/kernel/mm/hugepages/hugepages-*/ contain both
> the surplus and overcommit counts.
> 
> node specific sysfs directories only contain surplus counts.
Node specific sysfs directories don't contain resv_hugepages too.
After resolving this issue, I will attempt to assess the feasibility
about how to implement node-specific reservations and overcommitment.
> 
> > 2. If nr_hugepages means persistent hugetlb pages, converting in-use huge
> > pages to surplus huge pages is impossible.
> 
> I am not sure I understand.  When writing to nr_hugepages today, it does
> mean persistent hugetlb pages.  Are you suggesting we change it to mean
> total hugetlb pages when writing/updating?  I do not think that is the
> case, as none of your proposed changes do this.
Still, I'm wrong.
> 
> > 3. As you know, backward compatibility is not retained.
> > 
> > Given that the document needs to be modified anyway, why not make the
> > interface more user-friendly?
> 
> In any case, I agree the document should be updated to match the code.
> It should also define persistent hugetlb pages.
Yes, I'll add it in the v2 patch.
> 
> Thank you,
> -- 
> Mike Kravetz




[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux