On Fri, 28 Aug 2009 12:03:51 -0400 Lee Schermerhorn wrote: (Thanks for cc:, David.) > [PATCH 6/6] hugetlb: update hugetlb documentation for mempolicy based management. > > Against: 2.6.31-rc7-mmotm-090827-0057 > > V2: Add brief description of per node attributes. > > This patch updates the kernel huge tlb documentation to describe the > numa memory policy based huge page management. Additionaly, the patch > includes a fair amount of rework to improve consistency, eliminate > duplication and set the context for documenting the memory policy > interaction. > > Signed-off-by: Lee Schermerhorn <lee.schermerhorn@xxxxxx> > > Documentation/vm/hugetlbpage.txt | 257 ++++++++++++++++++++++++++------------- > 1 file changed, 172 insertions(+), 85 deletions(-) > > Index: linux-2.6.31-rc7-mmotm-090827-0057/Documentation/vm/hugetlbpage.txt > =================================================================== > --- linux-2.6.31-rc7-mmotm-090827-0057.orig/Documentation/vm/hugetlbpage.txt 2009-08-28 09:21:16.000000000 -0400 > +++ linux-2.6.31-rc7-mmotm-090827-0057/Documentation/vm/hugetlbpage.txt 2009-08-28 09:21:32.000000000 -0400 > @@ -53,26 +51,25 @@ HugePages_Surp is short for "surplus," > /proc/filesystems should also show a filesystem of type "hugetlbfs" configured > in the kernel. > > -/proc/sys/vm/nr_hugepages indicates the current number of configured hugetlb > -pages in the kernel. Super user can dynamically request more (or free some > -pre-configured) huge pages. > -The allocation (or deallocation) of hugetlb pages is possible only if there are > -enough physically contiguous free pages in system (freeing of huge pages is > -possible only if there are enough hugetlb pages free that can be transferred > -back to regular memory pool). > - > -Pages that are used as hugetlb pages are reserved inside the kernel and cannot > -be used for other purposes. > - > -Once the kernel with Hugetlb page support is built and running, a user can > -use either the mmap system call or shared memory system calls to start using > -the huge pages. It is required that the system administrator preallocate > -enough memory for huge page purposes. > - > -The administrator can preallocate huge pages on the kernel boot command line by > -specifying the "hugepages=N" parameter, where 'N' = the number of huge pages > -requested. This is the most reliable method for preallocating huge pages as > -memory has not yet become fragmented. > +/proc/sys/vm/nr_hugepages indicates the current number of huge pages pre- > +allocated in the kernel's huge page pool. These are called "persistent" > +huge pages. A user with root privileges can dynamically allocate more or > +free some persistent huge pages by increasing or decreasing the value of > +'nr_hugepages'. > + > +Pages that are used as huge pages are reserved inside the kernel and cannot > +be used for other purposes. Huge pages can not be swapped out under cannot > +memory pressure. > + > +Once a number of huge pages have been pre-allocated to the kernel huge page > +pool, a user with appropriate privilege can use either the mmap system call > +or shared memory system calls to use the huge pages. See the discussion of > +Using Huge Pages, below below. > + > +The administrator can preallocate persistent huge pages on the kernel boot > +command line by specifying the "hugepages=N" parameter, where 'N' = the > +number of requested huge pages requested. This is the most reliable method drop first "requested" > +or preallocating huge pages as memory has not yet become fragmented. of > > Some platforms support multiple huge page sizes. To preallocate huge pages > of a specific size, one must preceed the huge pages boot command parameters > @@ -80,19 +77,24 @@ with a huge page size selection paramete > be specified in bytes with optional scale suffix [kKmMgG]. The default huge > page size may be selected with the "default_hugepagesz=<size>" boot parameter. > > -/proc/sys/vm/nr_hugepages indicates the current number of configured [default > -size] hugetlb pages in the kernel. Super user can dynamically request more > -(or free some pre-configured) huge pages. > - > -Use the following command to dynamically allocate/deallocate default sized > -huge pages: > +When multiple huge page sizes are supported, /proc/sys/vm/nr_hugepages > +indicates the current number of pre-allocated huge pages of the default size. > +Thus, one can use the following command to dynamically allocate/deallocate > +default sized persistent huge pages: > > echo 20 > /proc/sys/vm/nr_hugepages > > -This command will try to configure 20 default sized huge pages in the system. > +This command will try to adjust the number of default sized huge pages in the > +huge page pool to 20, allocating or freeing huge pages, as required. > + > On a NUMA platform, the kernel will attempt to distribute the huge page pool > -over the all on-line nodes. These huge pages, allocated when nr_hugepages > -is increased, are called "persistent huge pages". > +over the all the nodes specified by the NUMA memory policy of the task that drop first "the" > +modifies nr_hugepages that contain sufficient available contiguous memory. whoa. too many "that"s. confusing. > +These nodes are called the huge pages "allowed nodes". The default for the > +huge pages allowed nodes--when the task has default memory policy--is all > +on-line nodes. See the discussion below of the interaction of task memory > +policy, cpusets and per node attributes with the allocation and freeing of > +persistent huge pages. > > The success or failure of huge page allocation depends on the amount of > physically contiguous memory that is preset in system at the time of the > @@ -101,11 +103,11 @@ some nodes in a NUMA system, it will att ... > @@ -113,39 +115,40 @@ distribution of huge pages in a NUMA sys > /proc/sys/vm/nr_overcommit_hugepages specifies how large the pool of > huge pages can grow, if more huge pages than /proc/sys/vm/nr_hugepages are > requested by applications. Writing any non-zero value into this file > -indicates that the hugetlb subsystem is allowed to try to obtain "surplus" > -huge pages from the buddy allocator, when the normal pool is exhausted. As > -these surplus huge pages go out of use, they are freed back to the buddy > -allocator. > +indicates that the hugetlb subsystem is allowed to try to obtain that > +number of "surplus" huge pages from the kernel's normal page pool, when the > +persistent huge page pool is exhausted. As these surplus huge pages become > +unused, they are freed back to the kernel's normal page pool. > > -When increasing the huge page pool size via nr_hugepages, any surplus > +When increasing the huge page pool size via nr_hugepages, any existing surplus > pages will first be promoted to persistent huge pages. Then, additional > huge pages will be allocated, if necessary and if possible, to fulfill > -the new huge page pool size. > +the new persistent huge page pool size. > > The administrator may shrink the pool of preallocated huge pages for > the default huge page size by setting the nr_hugepages sysctl to a > smaller value. The kernel will attempt to balance the freeing of huge pages > -across all on-line nodes. Any free huge pages on the selected nodes will > -be freed back to the buddy allocator. > - > -Caveat: Shrinking the pool via nr_hugepages such that it becomes less > -than the number of huge pages in use will convert the balance to surplus > -huge pages even if it would exceed the overcommit value. As long as > -this condition holds, however, no more surplus huge pages will be > -allowed on the system until one of the two sysctls are increased > -sufficiently, or the surplus huge pages go out of use and are freed. > +across all nodes in the memory policy of the task modifying nr_hugepages. > +Any free huge pages on the selected nodes will be freed back to the kernel's > +normal page pool. > + > +Caveat: Shrinking the persistent huge page pool via nr_hugepages such that > +it becomes less than the number of huge pages in use will convert the balance > +of the in-use huge pages to surplus huge pages. This will occur even if surplus allocated huge pages ? vs. surplus available huge pages? surplus (to me) implies available/unallocated... Reading more below, I see that "surplus" here means "overcommitted". oh well ;) > +the number of surplus pages it would exceed the overcommit value. As long as > +this condition holds--that is, until nr_hugepages+nr_overcommit_hugepages is > +increased sufficiently, or the surplus huge pages go out of use and are freed-- > +no more surplus huge pages will be allowed to be allocated. > > With support for multiple huge page pools at run-time available, much of > -the huge page userspace interface has been duplicated in sysfs. The above > -information applies to the default huge page size which will be > -controlled by the /proc interfaces for backwards compatibility. The root > -huge page control directory in sysfs is: > +the huge page userspace interface in /proc/sys/vm has been duplicated in sysfs. > +The /proc interfaces discussed above have been retained for backwards > +compatibility. The root huge page control directory in sysfs is: > > /sys/kernel/mm/hugepages > > For each huge page size supported by the running kernel, a subdirectory > -will exist, of the form > +will exist, of the form: > > hugepages-${size}kB > > @@ -159,6 +162,98 @@ Inside each of these directories, the sa > > which function as described above for the default huge page-sized case. > > + > +Interaction of Task Memory Policy with Huge Page Allocation/Freeing: > + > +Whether huge pages are allocated and freed via the /proc interface or > +the /sysfs interface, the NUMA nodes from which huge pages are allocated > +or freed are controlled by the NUMA memory policy of the task that modifies > +the nr_hugepages parameter. [nr_overcommit_hugepages is a global limit.] > + > +The recommended method to allocate or free huge pages to/from the kernel > +huge page pool, using the nr_hugepages example above, is: > + > + numactl --interleave <node-list> echo 20 >/proc/sys/vm/nr_hugepages. drop '.' > + > +or, more succinctly: > + > + numactl -m <node-list> echo 20 >/proc/sys/vm/nr_hugepages. ditto > + > +This will allocate or free abs(20 - nr_hugepages) to or from the nodes > +specified in <node-list>, depending on whether nr_hugepages is initially > +less than or greater than 20, respectively. No huge pages will be > +allocated nor freed on any node not included in the specified <node-list>. > + > +Any memory policy mode--bind, preferred, local or interleave--may be > +used. The effect on persistent huge page allocation will be as follows: I would just use present tense as much as possible, e.g., allocation is as follows: > + > +1) Regardless of mempolicy mode [see Documentation/vm/numa_memory_policy.txt], > + persistent huge pages will be distributed across the node or nodes > + specified in the mempolicy as if "interleave" had been specified. > + However, if a node in the policy does not contain sufficient contiguous > + memory for a huge page, the allocation will not "fallback" to the nearest > + neighbor node with sufficient contiguous memory. To do this would cause > + undesirable imbalance in the distribution of the huge page pool, or > + possibly, allocation of persistent huge pages on nodes not allowed by > + the task's memory policy. > + > +2) One or more nodes may be specified with the bind or interleave policy. > + If more than one node is specified with the preferred policy, only the > + lowest numeric id will be used. Local policy will select the node where > + the task is running at the time the nodes_allowed mask is constructed. > + > +3) For local policy to be deterministic, the task must be bound to a cpu or > + cpus in a single node. Otherwise, the task could be migrated to some I prefer s/cpu/CPU/ in all of Documentation/ text, but the cat is already out of the bag on that. > + other node at any time after launch and the resulting node will be > + indeterminate. Thus, local policy is not very useful for this purpose. > + Any of the other mempolicy modes may be used to specify a single node. > + > +4) The nodes allowed mask will be derived from any non-default task mempolicy, > + whether this policy was set explicitly by the task itself or one of its > + ancestors, such as numactl. This means that if the task is invoked from a > + shell with non-default policy, that policy will be used. One can specify a > + node list of "all" with numactl --interleave or --membind [-m] to achieve > + interleaving over all nodes in the system or cpuset. > + > +5) Any task mempolicy specifed--e.g., using numactl--will be constrained by > + the resource limits of any cpuset in which the task runs. Thus, there will > + be no way for a task with non-default policy running in a cpuset with a > + subset of the system nodes to allocate huge pages outside the cpuset > + without first moving to a cpuset that contains all of the desired nodes. > + > +6) Hugepages allocated at boot time always use the node_online_map. > + > + > +Per Node Hugepages Attributes > + > +A subset of the contents of the root huge page control directory in sysfs, > +described above, has been replicated under each "node" system device in: > + > + /sys/devices/system/node/node[0-9]*/hugepages/ > + > +Under this directory, the subdirectory for each supported huge page size > +contains the following attribute files: > + > + nr_hugepages > + free_hugepages > + surplus_hugepages > + > +The free_' and surplus_' attribute files are read-only. They return the number > +of free and surplus [overcommitted] huge pages, respectively, on the parent > +node. > + > +The nr_hugepages attribute will return the total number of huge pages on the > +specified node. When this attribute is written, the number of persistent huge > +pages on the parent node will be adjusted to the specified value, if sufficient > +resources exist, regardless of the task's mempolicy or cpuset constraints. > + > +Note that the number of overcommit and reserve pages remain global quantities, > +as we don't know until fault time, when the faulting task's mempolicy is applied, > +from which node the huge page allocation will be attempted. > + > + > +Using Huge Pages: > + > If the user applications are going to request huge pages using mmap system > call, then it is required that system administrator mount a file system of > type hugetlbfs: > @@ -206,9 +301,11 @@ map_hugetlb.c. ... > @@ -237,14 +334,8 @@ map_hugetlb.c. ... > @@ -302,10 +393,12 @@ int main(void) ... > @@ -317,14 +410,8 @@ int main(void) ... --- ~Randy LPC 2009, Sept. 23-25, Portland, Oregon http://linuxplumbersconf.org/2009/ -- To unsubscribe from this list: send the line "unsubscribe linux-numa" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html