On Tue, Aug 01, 2023 at 11:49:42AM -0700, Mike Kravetz wrote: > On 08/01/23 20:22, Xueshi Hu wrote: > > On Tue, Aug 1, 2023 at 6:17 AM Mike Kravetz <mike.kravetz@xxxxxxxxxx> wrote: > > > > > > On 07/30/23 20:51, Xueshi Hu wrote: > > > > When writing to /proc/sys/vm/nr_huge_pages, it indicates global number of > > > > huge pages of the default hstate. But when reading from it, it indicates > > > > the current number of "persistent" huge pages in the kernel's huge page > > > > pool. > > > > > > > > There are currently four interfaces used to export the number of huge > > > > pages: > > > > - /proc/meminfo > > > > - /proc/sys/vm/*hugepages* > > > > - /sys/devices/system/node/node0/hugepages/hugepages-2048kB/* > > > > - /sys/kernel/mm/hugepages/hugepages-2048kB/* > > > > > > > > But only the /proc/sys/vm/nr_huge_pages provides the 'persistent' > > > > semantics when reading from it. This inconsistency is very subtle and can > > > > be easily misunderstood. > > > > > > Thanks for looking into this. > > > > > > The hugetlb documentation (./admin-guide/mm/hugetlbpage.rst) mentions > > > the term 'persistent hugetlb pages', but never provides a definition. > > > > > > We can get the definition from the code as: > > > #define persistent_huge_pages(h) (h->nr_huge_pages - h->surplus_huge_pages) > > > > > > Further, the documentation says: > > > "The ``/proc/meminfo`` file provides information about the total number of > > > persistent hugetlb pages in the kernel's huge page pool." > > > > > > "``/proc/sys/vm/nr_hugepages`` indicates the current number of "persistent" > > > huge pages in the kernel's huge page pool." > > > > > > "The administrator may shrink the pool of persistent huge pages for > > > the default huge page size by setting the ``nr_hugepages`` sysctl to a > > > smaller value." > > > > > > So, the documentation implies that these interfaces should display the > > > number of persistent hugetlb pages. As you have discovered, all but the > > > sysctl interface (and /proc/sys/vm/nr_hugepages) displays the total > > > number of hugetlb pages rather than the number of persistent hugetlb > > > pages. > > > > > > If we wanted to match the documentation, it seems we should change all > > > the "show" interfaces to display persistent huge pages. However, I am a > > > bit concerned about how this may impact end users. > > > > > > There are two types if inconsistencies in these interfaces. > > > 1) As this patch points out, not all "show" interfaces provide the same > > > information. sysctl (/proc/sys/vm/nr_hugepages) displays the number > > > of persistent hugetlb pages, while the others display the total number > > > of hugetlb pages. > > > 2) The show/read interfaces generally provide the total number of > > > hugetlb pages, and the update/write interfaces update the number of > > > persistent hugetlb pages. > > > > > > Both of these situations can lead to user confusion. My 'guess' is that > > > this has not been a widespread issue as most hugetlb users do not > > > configure overcommit/surplus hugetlb pages and thus total number of > > > hugetlb pages is the same as number of persistent hugetlb pages. > > > > > > Right now, I would suggest making all these interfaces display/take the > > > number of persistent hugetlb pages for consistency. This also matches > > > the documentation. > > > > > > Thoughts? > > I am concerned that modifying it this way may result in an weaker control > > over hugetlb pages. Administrator will no longer be able to increase > > surplus pages through the nr_hugepages interface. > > > > Since surplus pages depend on the state of programs in the entire > > system, adjusting nr_hugepages may lead to an unexpected number of > > hugetlbs allocated which may leads to oom. > > Sorry, I am not sure I understand your concerns. I'm wrong, just ignore what I've said. > > Currently, the interfaces to set/update the number of hugetlb pages use > the supplied count as the number of requested persistent pages. I am > not suggesting any changes there (except the bug in node specific code > you discovered). Rather, I am suggesting that we update the interfaces > which show the number of hugepages (nr_hugepages) to display the number > of persistent pages to be consistent with the set/update interfaces. I agree with you. > > > About the definition of /proc/sys/vm/nr_huge_pages and meaning of > > "persistent", the documentation is kind of ambiguous. > > > > The documentation says: > > > > "The ``/proc/meminfo`` file provides information about the total number of > > persistent hugetlb pages in the kernel's huge page pool." > > > > "Caveat: Shrinking the persistent huge page pool via ``nr_hugepages`` > > such that it becomes less than the number of huge pages in use will > > convert the balance of the in-use huge pages to surplus huge pages." > > > > "The ``/proc`` interfaces discussed above have been retained for backwards > > compatibility." > > > > The ambiguities are: > > 1. HugePages_Total in /proc/meminfo is actually the total number of > > hugetlb pages. > > Correct. Although the documentation states it is the number of > persistent hugetlb pages. meminfo also contains the number of surplus > huge pages. So, it it possible that one could see > > HugePages_Total: 0 > HugePages_Surp: 100 It's easy to fix. > > Ideally, one would want to know the value for overcommit hugepages as > well. It will be straightforward to achieve this. > > The sysfs directories /sys/kernel/mm/hugepages/hugepages-*/ contain both > the surplus and overcommit counts. > > node specific sysfs directories only contain surplus counts. Node specific sysfs directories don't contain resv_hugepages too. After resolving this issue, I will attempt to assess the feasibility about how to implement node-specific reservations and overcommitment. > > > 2. If nr_hugepages means persistent hugetlb pages, converting in-use huge > > pages to surplus huge pages is impossible. > > I am not sure I understand. When writing to nr_hugepages today, it does > mean persistent hugetlb pages. Are you suggesting we change it to mean > total hugetlb pages when writing/updating? I do not think that is the > case, as none of your proposed changes do this. Still, I'm wrong. > > > 3. As you know, backward compatibility is not retained. > > > > Given that the document needs to be modified anyway, why not make the > > interface more user-friendly? > > In any case, I agree the document should be updated to match the code. > It should also define persistent hugetlb pages. Yes, I'll add it in the v2 patch. > > Thank you, > -- > Mike Kravetz