On 8/8/23 15:58, David Hildenbrand wrote:
On 08.08.23 04:28, Xueshi Hu wrote:
On 8/7/23 23:15, David Hildenbrand wrote:
On 06.08.23 09:48, Xueshi Hu wrote:
There are currently three 'nr_hugepages' used to export the number of
huge
pages:
1. /proc/sys/vm/nr_hugepages
2.
/sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages
3. /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
For consistency, all three 'nr_hugepages' should return the total
number
of huge pages. When written, the number of persistent huge pages
will be
adjusted to the specified value.
But, /proc/sys/vm/nr_hugepages returns the number of persistent huge
pages.
But that's documented behavior, no?
Documentation/admin-guide/mm/hugetlbpage.rst
``/proc/sys/vm/nr_hugepages`` indicates the current number of
"persistent" huge
pages in the kernel's huge page pool. "Persistent" huge pages will be
returned to the huge page pool when freed by a task. A user with root
privileges can dynamically allocate more or free some persistent huge
pages
by increasing or decreasing the value of ``nr_hugepages``.
Actually, Documentation/admin-guide/mm/hugetlbpage.rst is contradictory
about the definition of /proc/sys/vm/nr_hugepages.
The documentation says:
- ``/proc/sys/vm/nr_hugepages`` indicates the current number of
"persistent" huge.
But, the documentation also says:
- The ``/proc`` interfaces discussed above have been retained for
backwards compatibility.
Yes. And why shouldn't the behavior of these toggles be retained for
backwards compatibility?
- The ``nr_hugepages`` attribute returns the total number of huge
pages on
the specified node. When this attribute is written, the number of
persistent huge pages on the parent node will be adjusted to the
specified
value, if sufficient resources exist, regardless of the task's mempolicy
or cpuset constraints.
That is part of the "/sys/devices/system/node/node[0-9]*/hugepages/"
docuemntation,
not "/proc/sys/vm/nr_hugepages" documentation.
It's in the "Per Node Hugepages Attributes" section.
The document states that compatibility has been maintained between
/sys/devices/system/node/node[0-9]*/hugepages/nr_hugepages and
/proc/sys/vm/nr_hugepages. But, /proc/sys/vm/nr_hugepages displays the
number of
persistent hugetlb pages,
while /sys/devices/system/node/node[0-9]*/hugepages/nr_hugepages shows total
number hugetlb pages.
For clearness, an example:
Prepare:
echo 100 > /proc/sys/vm/nr_hugepages
launch a program to reserve 100 hupatlb pages, then sleep
echo 0 > /proc/sys/vm/nr_hugepages
Check the result:
cat /proc/sys/vm/nr_hugepages
0
cat /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
100
If the compatibility is maintained, they should return the same value,
0 or 100.
Also, check the /proc/meminfo:
grep "HugePages_" /proc/meminfo
HugePages_Total: 100
HugePages_Free: 100
HugePages_Rsvd: 100
HugePages_Surp: 100
The documentation says:
The ``/proc/meminfo`` file provides information about the total number of
persistent hugetlb pages in the kernel's huge page pool.
As you see, HugePages_Total should the total number of hugetlb pages
instead of "total number of persistent hugetlb pages". This is one the
mistakes the documentation made.
If I am not wrong, that documentation -- including the usage of the
"persistent"
term -- were introduced in 2009 already:
commit 267b4c281b4a43c8f3d965c791d3a7fd62448733
Author: Lee Schermerhorn <lee.schermerhorn@xxxxxx>
Date: Mon Dec 14 17:58:30 2009 -0800
hugetlb: update hugetlb documentation for NUMA controls
Update the kernel huge tlb documentation to describe the numa memory
policy based huge page management. Additionaly, the patch includes
a fair
amount of rework to improve consistency, eliminate duplication and
set the
context for documenting the memory policy interaction.
So this has been documented behavior for a long time.
So, I create the patch 4 to make the documentation more clear.
If such subtle inconsistencies result in unexpected behavior, it can be
challenging for a system administrator to detect.
Your patch is changing the traditional, documented behavior to then change
the documentation to match the new implementation.
I'm not intended to modify a correct api and then change the documentation,
that's totally time-wasting. The API can be easily misused and the
documentation is unclear, that's why I send the patchset.
How can you be sure nobody relies on exactly that traditional, well
documented
behavior?
Considering no special statement about the inconsistency, I think more
people will just misuse it instead of relying on the minor
inconsistency.
Why change the behavior of an interface that is kept for backwards
compatibility,
that might still be in use under the assumptions that the behavior is
exactly
like that (and for at least the last 14 years)?
Maybe, you mean "to keep backwards compatibility" ?
Thanks,
Hu