Re: [PATCH v2 1/4] mm/hugetlb: fix the inconsistency of /proc/sys/vm/nr_huge_pages

Xueshi Hu <xueshi.hu@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx> · Tue, 8 Aug 2023 17:13:11 +0800

On 8/8/23 15:58, David Hildenbrand wrote:
On 08.08.23 04:28, Xueshi Hu wrote:
On 8/7/23 23:15, David Hildenbrand wrote:
On 06.08.23 09:48, Xueshi Hu wrote:
There are currently three 'nr_hugepages' used to export the number of
huge
pages:
1. /proc/sys/vm/nr_hugepages
2. 
/sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages
3. /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages

For consistency, all three 'nr_hugepages' should return the total 
number
of huge pages. When written, the number of persistent huge pages 
will be
adjusted to the specified value.

But, /proc/sys/vm/nr_hugepages returns the number of persistent huge
pages.

But that's documented behavior, no?

Documentation/admin-guide/mm/hugetlbpage.rst

``/proc/sys/vm/nr_hugepages`` indicates the current number of
"persistent" huge
pages in the kernel's huge page pool.  "Persistent" huge pages will be
returned to the huge page pool when freed by a task.  A user with root
privileges can dynamically allocate more or free some persistent huge 
pages
by increasing or decreasing the value of ``nr_hugepages``.

Actually, Documentation/admin-guide/mm/hugetlbpage.rst is contradictory
about the definition of /proc/sys/vm/nr_hugepages.

The documentation says:

- ``/proc/sys/vm/nr_hugepages`` indicates the current number of
"persistent" huge.

But, the documentation also says:

- The ``/proc`` interfaces discussed above have been retained for
backwards compatibility.

Yes. And why shouldn't the behavior of these toggles be retained for 
backwards compatibility?

- The ``nr_hugepages`` attribute returns the total number of huge 
pages on
the specified node.  When this attribute is written, the number of
persistent huge pages on the parent node will be adjusted to the 
specified
value, if sufficient resources exist, regardless of the task's mempolicy
or cpuset constraints.

That is part of the "/sys/devices/system/node/node[0-9]*/hugepages/" 
docuemntation,
not "/proc/sys/vm/nr_hugepages" documentation.

It's in the "Per Node Hugepages Attributes" section.
The document states that compatibility has been maintained between
/sys/devices/system/node/node[0-9]*/hugepages/nr_hugepages and
/proc/sys/vm/nr_hugepages. But, /proc/sys/vm/nr_hugepages displays the 
number of
persistent hugetlb pages,
while /sys/devices/system/node/node[0-9]*/hugepages/nr_hugepages shows total
number hugetlb pages.

For clearness, an example:

Prepare:

	echo 100 > /proc/sys/vm/nr_hugepages
	launch a program to reserve 100 hupatlb pages, then sleep
	echo 0 >  /proc/sys/vm/nr_hugepages

Check the result:

	cat /proc/sys/vm/nr_hugepages

	0

	cat /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages

	100

If the compatibility is maintained, they should return the same value,
0 or 100.

Also, check the /proc/meminfo:

	grep "HugePages_" /proc/meminfo

		HugePages_Total:     100
		HugePages_Free:      100
		HugePages_Rsvd:      100
		HugePages_Surp:      100

The documentation says:

The ``/proc/meminfo`` file provides information about the total number of
persistent hugetlb pages in the kernel's huge page pool.	

As you see, HugePages_Total should the total number of hugetlb pages
instead of "total number of persistent hugetlb pages". This is one the
mistakes the documentation made.

If I am not wrong, that documentation -- including the usage of the 
"persistent"
term -- were introduced in 2009 already:

commit 267b4c281b4a43c8f3d965c791d3a7fd62448733
Author: Lee Schermerhorn <lee.schermerhorn@xxxxxx>
Date:   Mon Dec 14 17:58:30 2009 -0800

     hugetlb: update hugetlb documentation for NUMA controls
     Update the kernel huge tlb documentation to describe the numa memory
     policy based huge page management.  Additionaly, the patch includes 
a fair
     amount of rework to improve consistency, eliminate duplication and 
set the
     context for documenting the memory policy interaction.
So this has been documented behavior for a long time.

So, I create the patch 4 to make the documentation more clear.

If such subtle inconsistencies result in unexpected behavior, it can be
challenging for a system administrator to detect.

Your patch is changing the traditional, documented behavior to then change
the documentation to match the new implementation.
I'm not intended to modify a correct api and then change the documentation,
that's totally time-wasting. The API can be easily misused and the
documentation is unclear, that's why I send the patchset.

How can you be sure nobody relies on exactly that traditional, well 
documented
behavior?
Considering no special statement about the inconsistency, I think more
people will just misuse it instead of relying on the minor
inconsistency.

Why change the behavior of an interface that is kept for backwards 
compatibility,
that might still be in use under the assumptions that the behavior is 
exactly
like that (and for at least the last 14 years)?

Maybe, you mean "to keep backwards compatibility" ?

Thanks,
Hu