Re: [PATCH 6/6] hugetlb: update hugetlb documentation for mempolicy based management.

Lee Schermerhorn <Lee.Schermerhorn@xxxxxx> · Thu, 03 Sep 2009 17:09:18 -0400

On Thu, 2009-09-03 at 13:07 -0700, David Rientjes wrote:
> On Fri, 28 Aug 2009, Lee Schermerhorn wrote:
> 
> > [PATCH 6/6] hugetlb:  update hugetlb documentation for mempolicy based management.
> > 
> > Against: 2.6.31-rc7-mmotm-090827-0057
> > 
> > V2:  Add brief description of per node attributes.
> > 
> > This patch updates the kernel huge tlb documentation to describe the
> > numa memory policy based huge page management.  Additionaly, the patch
> > includes a fair amount of rework to improve consistency, eliminate
> > duplication and set the context for documenting the memory policy
> > interaction.
> > 
> > Signed-off-by: Lee Schermerhorn <lee.schermerhorn@xxxxxx>
> 
> Adding Randy to the cc.  Comments below, but otherwise:
> 
> Acked-by: David Rientjes <rientjes@xxxxxxxxxx>
> 
> > 
> >  Documentation/vm/hugetlbpage.txt |  257 ++++++++++++++++++++++++++-------------
> >  1 file changed, 172 insertions(+), 85 deletions(-)
> > 
> > Index: linux-2.6.31-rc7-mmotm-090827-0057/Documentation/vm/hugetlbpage.txt
> > ===================================================================
> > --- linux-2.6.31-rc7-mmotm-090827-0057.orig/Documentation/vm/hugetlbpage.txt	2009-08-28 09:21:16.000000000 -0400
> > +++ linux-2.6.31-rc7-mmotm-090827-0057/Documentation/vm/hugetlbpage.txt	2009-08-28 09:21:32.000000000 -0400
> > @@ -11,23 +11,21 @@ This optimization is more critical now a
> >  (several GBs) are more readily available.
> >  
> >  Users can use the huge page support in Linux kernel by either using the mmap
> > -system call or standard SYSv shared memory system calls (shmget, shmat).
> > +system call or standard SYSV shared memory system calls (shmget, shmat).
> >  
> >  First the Linux kernel needs to be built with the CONFIG_HUGETLBFS
> >  (present under "File systems") and CONFIG_HUGETLB_PAGE (selected
> >  automatically when CONFIG_HUGETLBFS is selected) configuration
> >  options.
> >  
> > -The kernel built with huge page support should show the number of configured
> > -huge pages in the system by running the "cat /proc/meminfo" command.
> > +The /proc/meminfo file provides information about the total number of hugetlb
> > +pages preallocated in the kernel's huge page pool.  It also displays
> > +information about the number of free, reserved and surplus huge pages and the
> > +[default] huge page size.  The huge page size is needed for generating the
> 
> Don't think the brackets are needed.

will fix

> 
> > +proper alignment and size of the arguments to system calls that map huge page
> > +regions.
> >  
> > -/proc/meminfo also provides information about the total number of hugetlb
> > -pages configured in the kernel.  It also displays information about the
> > -number of free hugetlb pages at any time.  It also displays information about
> > -the configured huge page size - this is needed for generating the proper
> > -alignment and size of the arguments to the above system calls.
> > -
> > -The output of "cat /proc/meminfo" will have lines like:
> > +The output of "cat /proc/meminfo" will include lines like:
> >  
> >  .....
> >  HugePages_Total: vvv
> > @@ -53,26 +51,25 @@ HugePages_Surp  is short for "surplus,"
> >  /proc/filesystems should also show a filesystem of type "hugetlbfs" configured
> >  in the kernel.
> >  
> > -/proc/sys/vm/nr_hugepages indicates the current number of configured hugetlb
> > -pages in the kernel.  Super user can dynamically request more (or free some
> > -pre-configured) huge pages.
> > -The allocation (or deallocation) of hugetlb pages is possible only if there are
> > -enough physically contiguous free pages in system (freeing of huge pages is
> > -possible only if there are enough hugetlb pages free that can be transferred
> > -back to regular memory pool).
> > -
> > -Pages that are used as hugetlb pages are reserved inside the kernel and cannot
> > -be used for other purposes.
> > -
> > -Once the kernel with Hugetlb page support is built and running, a user can
> > -use either the mmap system call or shared memory system calls to start using
> > -the huge pages.  It is required that the system administrator preallocate
> > -enough memory for huge page purposes.
> > -
> > -The administrator can preallocate huge pages on the kernel boot command line by
> > -specifying the "hugepages=N" parameter, where 'N' = the number of huge pages
> > -requested.  This is the most reliable method for preallocating huge pages as
> > -memory has not yet become fragmented.
> > +/proc/sys/vm/nr_hugepages indicates the current number of huge pages pre-
> > +allocated in the kernel's huge page pool.  These are called "persistent"
> > +huge pages.  A user with root privileges can dynamically allocate more or
> > +free some persistent huge pages by increasing or decreasing the value of
> > +'nr_hugepages'.
> > +
> 
> So they're not necessarily "preallocated" then if they're already in use.

I don't see what in the text you're referring to"  "preallocated" vs
"already in use" ???

> 
> > +Pages that are used as huge pages are reserved inside the kernel and cannot
> > +be used for other purposes.  Huge pages can not be swapped out under
> > +memory pressure.
> > +
> > +Once a number of huge pages have been pre-allocated to the kernel huge page
> > +pool, a user with appropriate privilege can use either the mmap system call
> > +or shared memory system calls to use the huge pages.  See the discussion of
> > +Using Huge Pages, below
> > +
> > +The administrator can preallocate persistent huge pages on the kernel boot
> > +command line by specifying the "hugepages=N" parameter, where 'N' = the
> > +number of requested huge pages requested.  This is the most reliable method
> > +or preallocating huge pages as memory has not yet become fragmented.
> >  
> >  Some platforms support multiple huge page sizes.  To preallocate huge pages
> >  of a specific size, one must preceed the huge pages boot command parameters
> > @@ -80,19 +77,24 @@ with a huge page size selection paramete
> >  be specified in bytes with optional scale suffix [kKmMgG].  The default huge
> >  page size may be selected with the "default_hugepagesz=<size>" boot parameter.
> >  
> > -/proc/sys/vm/nr_hugepages indicates the current number of configured [default
> > -size] hugetlb pages in the kernel.  Super user can dynamically request more
> > -(or free some pre-configured) huge pages.
> > -
> > -Use the following command to dynamically allocate/deallocate default sized
> > -huge pages:
> > +When multiple huge page sizes are supported, /proc/sys/vm/nr_hugepages
> > +indicates the current number of pre-allocated huge pages of the default size.
> > +Thus, one can use the following command to dynamically allocate/deallocate
> > +default sized persistent huge pages:
> >  
> >  	echo 20 > /proc/sys/vm/nr_hugepages
> >  
> > -This command will try to configure 20 default sized huge pages in the system.
> > +This command will try to adjust the number of default sized huge pages in the
> > +huge page pool to 20, allocating or freeing huge pages, as required.
> > +
> >  On a NUMA platform, the kernel will attempt to distribute the huge page pool
> > -over the all on-line nodes.  These huge pages, allocated when nr_hugepages
> > -is increased, are called "persistent huge pages".
> > +over the all the nodes specified by the NUMA memory policy of the task that
> 
> Remove the first 'the'.

OK.

> 
> > +modifies nr_hugepages that contain sufficient available contiguous memory.
> > +These nodes are called the huge pages "allowed nodes".  The default for the
> 
> Not sure if you need to spell out that they're called "huge page allowed 
> nodes," isn't that an implementation detail?  The way Paul Jackson used to 
> describe nodes_allowed is "set of allowable nodes," and I can't think of a 
> better phrase.  That's also how the cpuset documentation describes them.

I wanted to refer to "huge pages allowed nodes" to differentiate from,
e.g., cpusets mems_allowed"--i.e., I wanted the "huge pages" qualifier.
I suppose I could introduce the phrase you suggest:  "set of allowable
nodes" and emphasize that in this doc, it only refers to nodes from
which persistent huge pages will be allocated.

> 
> > +huge pages allowed nodes--when the task has default memory policy--is all
> > +on-line nodes.  See the discussion below of the interaction of task memory
> 
> All online nodes with memory, right?

See response to comment on patch 5/6.  We can only allocate huge pages
from nodes that have them available, but the current code [before these
patches] does visit all on-line nodes.  As I mentioned, changing this
could have hotplug {imp|comp}lications, and for this patch set, I don't
want to go there.

> 
> > +policy, cpusets and per node attributes with the allocation and freeing of
> > +persistent huge pages.
> >  
> >  The success or failure of huge page allocation depends on the amount of
> >  physically contiguous memory that is preset in system at the time of the
> > @@ -101,11 +103,11 @@ some nodes in a NUMA system, it will att
> >  allocating extra pages on other nodes with sufficient available contiguous
> >  memory, if any.
> >  
> > -System administrators may want to put this command in one of the local rc init
> > -files.  This will enable the kernel to request huge pages early in the boot
> > -process when the possibility of getting physical contiguous pages is still
> > -very high.  Administrators can verify the number of huge pages actually
> > -allocated by checking the sysctl or meminfo.  To check the per node
> > +System administrators may want to put this command in one of the local rc
> > +init files.  This will enable the kernel to preallocate huge pages early in
> > +the boot process when the possibility of getting physical contiguous pages
> > +is still very high.  Administrators can verify the number of huge pages
> > +actually allocated by checking the sysctl or meminfo.  To check the per node
> >  distribution of huge pages in a NUMA system, use:
> >  
> >  	cat /sys/devices/system/node/node*/meminfo | fgrep Huge
> > @@ -113,39 +115,40 @@ distribution of huge pages in a NUMA sys
> >  /proc/sys/vm/nr_overcommit_hugepages specifies how large the pool of
> >  huge pages can grow, if more huge pages than /proc/sys/vm/nr_hugepages are
> >  requested by applications.  Writing any non-zero value into this file
> > -indicates that the hugetlb subsystem is allowed to try to obtain "surplus"
> > -huge pages from the buddy allocator, when the normal pool is exhausted. As
> > -these surplus huge pages go out of use, they are freed back to the buddy
> > -allocator.
> > +indicates that the hugetlb subsystem is allowed to try to obtain that
> > +number of "surplus" huge pages from the kernel's normal page pool, when the
> > +persistent huge page pool is exhausted. As these surplus huge pages become
> > +unused, they are freed back to the kernel's normal page pool.
> >  
> > -When increasing the huge page pool size via nr_hugepages, any surplus
> > +When increasing the huge page pool size via nr_hugepages, any existing surplus
> >  pages will first be promoted to persistent huge pages.  Then, additional
> >  huge pages will be allocated, if necessary and if possible, to fulfill
> > -the new huge page pool size.
> > +the new persistent huge page pool size.
> >  
> >  The administrator may shrink the pool of preallocated huge pages for
> >  the default huge page size by setting the nr_hugepages sysctl to a
> >  smaller value.  The kernel will attempt to balance the freeing of huge pages
> > -across all on-line nodes.  Any free huge pages on the selected nodes will
> > -be freed back to the buddy allocator.
> > -
> > -Caveat: Shrinking the pool via nr_hugepages such that it becomes less
> > -than the number of huge pages in use will convert the balance to surplus
> > -huge pages even if it would exceed the overcommit value.  As long as
> > -this condition holds, however, no more surplus huge pages will be
> > -allowed on the system until one of the two sysctls are increased
> > -sufficiently, or the surplus huge pages go out of use and are freed.
> > +across all nodes in the memory policy of the task modifying nr_hugepages.
> > +Any free huge pages on the selected nodes will be freed back to the kernel's
> > +normal page pool.
> > +
> > +Caveat: Shrinking the persistent huge page pool via nr_hugepages such that
> > +it becomes less than the number of huge pages in use will convert the balance
> > +of the in-use huge pages to surplus huge pages.  This will occur even if
> > +the number of surplus pages it would exceed the overcommit value.  As long as
> > +this condition holds--that is, until nr_hugepages+nr_overcommit_hugepages is
> > +increased sufficiently, or the surplus huge pages go out of use and are freed--
> > +no more surplus huge pages will be allowed to be allocated.
> >  
> 
> Nice description!
> 
> >  With support for multiple huge page pools at run-time available, much of
> > -the huge page userspace interface has been duplicated in sysfs. The above
> > -information applies to the default huge page size which will be
> > -controlled by the /proc interfaces for backwards compatibility. The root
> > -huge page control directory in sysfs is:
> > +the huge page userspace interface in /proc/sys/vm has been duplicated in sysfs.
> > +The /proc interfaces discussed above have been retained for backwards
> > +compatibility. The root huge page control directory in sysfs is:
> >  
> >  	/sys/kernel/mm/hugepages
> >  
> >  For each huge page size supported by the running kernel, a subdirectory
> > -will exist, of the form
> > +will exist, of the form:
> >  
> >  	hugepages-${size}kB
> >  
> > @@ -159,6 +162,98 @@ Inside each of these directories, the sa
> >  
> >  which function as described above for the default huge page-sized case.
> >  
> > +
> > +Interaction of Task Memory Policy with Huge Page Allocation/Freeing:
> > +
> > +Whether huge pages are allocated and freed via the /proc interface or
> > +the /sysfs interface, the NUMA nodes from which huge pages are allocated
> > +or freed are controlled by the NUMA memory policy of the task that modifies
> > +the nr_hugepages parameter.  [nr_overcommit_hugepages is a global limit.]
> > +
> > +The recommended method to allocate or free huge pages to/from the kernel
> > +huge page pool, using the nr_hugepages example above, is:
> > +
> > +    numactl --interleave <node-list> echo 20 >/proc/sys/vm/nr_hugepages.
> > +
> > +or, more succinctly:
> > +
> > +    numactl -m <node-list> echo 20 >/proc/sys/vm/nr_hugepages.
> > +
> > +This will allocate or free abs(20 - nr_hugepages) to or from the nodes
> > +specified in <node-list>, depending on whether nr_hugepages is initially
> > +less than or greater than 20, respectively.  No huge pages will be
> > +allocated nor freed on any node not included in the specified <node-list>.
> > +
> 
> This is actually why I was against the mempolicy approach to begin with: 
> applications currently can free all hugepages on the system simply by 
> writing to nr_hugepages, regardless of their mempolicy.  It's now possible 
> that hugepages will remain allocated because they are on nodes disjoint 
> from current->mempolicy->v.nodes.  I hope the advantages of this approach 
> outweigh the potential userspace breakage of existing applications.

I understand.  However, I do think it's useful to support both a mask
[and Mel prefers it be based on mempolicy] and per node attributes.  On
some of our platforms, we do want explicit control over the placement of
huge pages--e.g., for a data base shared area or such.  So, we can say,
"I need <N> huge pages, and I want them on nodes 1, 3, 4 and 5", and
then, assuming we start with no huge pages allocated [free them all if
this is not the case]:

	numactl -m 1,3-5 hugeadm --pool-pages-min 2M:<N>

Later, if I decide that maybe I want to adjust the number on node 1, I
can:

	numactl -m 1 --pool-pages-min 2M:{+|-}<count>

or:

	echo <new-value> >/sys/devices/system/node/node1/hugepages/hugepages-2048KB/nr_hugepages

[Of course, I'd probably do this in a script to avoid all that typing :)]

> > +Any memory policy mode--bind, preferred, local or interleave--may be
> > +used.  The effect on persistent huge page allocation will be as follows:
> > +
> > +1) Regardless of mempolicy mode [see Documentation/vm/numa_memory_policy.txt],
> > +   persistent huge pages will be distributed across the node or nodes
> > +   specified in the mempolicy as if "interleave" had been specified.
> > +   However, if a node in the policy does not contain sufficient contiguous
> > +   memory for a huge page, the allocation will not "fallback" to the nearest
> > +   neighbor node with sufficient contiguous memory.  To do this would cause
> > +   undesirable imbalance in the distribution of the huge page pool, or
> > +   possibly, allocation of persistent huge pages on nodes not allowed by
> > +   the task's memory policy.
> > +
> 
> This is a good example of why the per-node tunables are helpful in case 
> such a fallback is desired.

Agreed.  And the fact that they do bypass any mempolicy.

> 
> > +2) One or more nodes may be specified with the bind or interleave policy.
> > +   If more than one node is specified with the preferred policy, only the
> > +   lowest numeric id will be used.  Local policy will select the node where
> > +   the task is running at the time the nodes_allowed mask is constructed.
> > +
> > +3) For local policy to be deterministic, the task must be bound to a cpu or
> > +   cpus in a single node.  Otherwise, the task could be migrated to some
> > +   other node at any time after launch and the resulting node will be
> > +   indeterminate.  Thus, local policy is not very useful for this purpose.
> > +   Any of the other mempolicy modes may be used to specify a single node.
> > +
> > +4) The nodes allowed mask will be derived from any non-default task mempolicy,
> > +   whether this policy was set explicitly by the task itself or one of its
> > +   ancestors, such as numactl.  This means that if the task is invoked from a
> > +   shell with non-default policy, that policy will be used.  One can specify a
> > +   node list of "all" with numactl --interleave or --membind [-m] to achieve
> > +   interleaving over all nodes in the system or cpuset.
> > +
> 
> Nice description.
> 
> > +5) Any task mempolicy specifed--e.g., using numactl--will be constrained by
> > +   the resource limits of any cpuset in which the task runs.  Thus, there will
> > +   be no way for a task with non-default policy running in a cpuset with a
> > +   subset of the system nodes to allocate huge pages outside the cpuset
> > +   without first moving to a cpuset that contains all of the desired nodes.
> > +
> > +6) Hugepages allocated at boot time always use the node_online_map.
> 
> Implementation detail in the name, maybe just say "all online nodes with 
> memory"?

OK.  will fix for V6.  soon come, I hope.

> 
> > +
> > +
> > +Per Node Hugepages Attributes
> > +
> > +A subset of the contents of the root huge page control directory in sysfs,
> > +described above, has been replicated under each "node" system device in:
> > +
> > +	/sys/devices/system/node/node[0-9]*/hugepages/
> > +
> > +Under this directory, the subdirectory for each supported huge page size
> > +contains the following attribute files:
> > +
> > +	nr_hugepages
> > +	free_hugepages
> > +	surplus_hugepages
> > +
> > +The free_' and surplus_' attribute files are read-only.  They return the number
> > +of free and surplus [overcommitted] huge pages, respectively, on the parent
> > +node.
> > +
> > +The nr_hugepages attribute will return the total number of huge pages on the
> > +specified node.  When this attribute is written, the number of persistent huge
> > +pages on the parent node will be adjusted to the specified value, if sufficient
> > +resources exist, regardless of the task's mempolicy or cpuset constraints.
> > +
> > +Note that the number of overcommit and reserve pages remain global quantities,
> > +as we don't know until fault time, when the faulting task's mempolicy is applied,
> > +from which node the huge page allocation will be attempted.
> > +
> > +
> > +Using Huge Pages:
> > +
> >  If the user applications are going to request huge pages using mmap system
> >  call, then it is required that system administrator mount a file system of
> >  type hugetlbfs:
> > @@ -206,9 +301,11 @@ map_hugetlb.c.
> >   * requesting huge pages.
> >   *
> >   * For the ia64 architecture, the Linux kernel reserves Region number 4 for
> > - * huge pages.  That means the addresses starting with 0x800000... will need
> > - * to be specified.  Specifying a fixed address is not required on ppc64,
> > - * i386 or x86_64.
> > + * huge pages.  That means that if one requires a fixed address, a huge page
> > + * aligned address starting with 0x800000... will be required.  If a fixed
> > + * address is not required, the kernel will select an address in the proper
> > + * range.
> > + * Other architectures, such as ppc64, i386 or x86_64 are not so constrained.
> >   *
> >   * Note: The default shared memory limit is quite low on many kernels,
> >   * you may need to increase it via:
> > @@ -237,14 +334,8 @@ map_hugetlb.c.
> >  
> >  #define dprintf(x)  printf(x)
> >  
> > -/* Only ia64 requires this */
> > -#ifdef __ia64__
> > -#define ADDR (void *)(0x8000000000000000UL)
> > -#define SHMAT_FLAGS (SHM_RND)
> > -#else
> > -#define ADDR (void *)(0x0UL)
> > +#define ADDR (void *)(0x0UL)	/* let kernel choose address */
> >  #define SHMAT_FLAGS (0)
> > -#endif
> >  
> >  int main(void)
> >  {
> > @@ -302,10 +393,12 @@ int main(void)
> >   * example, the app is requesting memory of size 256MB that is backed by
> >   * huge pages.
> >   *
> > - * For ia64 architecture, Linux kernel reserves Region number 4 for huge pages.
> > - * That means the addresses starting with 0x800000... will need to be
> > - * specified.  Specifying a fixed address is not required on ppc64, i386
> > - * or x86_64.
> > + * For the ia64 architecture, the Linux kernel reserves Region number 4 for
> > + * huge pages.  That means that if one requires a fixed address, a huge page
> > + * aligned address starting with 0x800000... will be required.  If a fixed
> > + * address is not required, the kernel will select an address in the proper
> > + * range.
> > + * Other architectures, such as ppc64, i386 or x86_64 are not so constrained.
> >   */
> >  #include <stdlib.h>
> >  #include <stdio.h>
> > @@ -317,14 +410,8 @@ int main(void)
> >  #define LENGTH (256UL*1024*1024)
> >  #define PROTECTION (PROT_READ | PROT_WRITE)
> >  
> > -/* Only ia64 requires this */
> > -#ifdef __ia64__
> > -#define ADDR (void *)(0x8000000000000000UL)
> > -#define FLAGS (MAP_SHARED | MAP_FIXED)
> > -#else
> > -#define ADDR (void *)(0x0UL)
> > +#define ADDR (void *)(0x0UL)	/* let kernel choose address */
> >  #define FLAGS (MAP_SHARED)
> > -#endif
> >  
> >  void check_bytes(char *addr)
> >  {
> > 

--
To unsubscribe from this list: send the line "unsubscribe linux-numa" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html