PATCH 1/10 libnuma man pages -- General Cleanup Against: numactl 1.0.3-rc1 1) globally remove trailing whitespace. Quilt complains... 2) clarify scope of memory policies: per thread, on an address space range--shared by all threads in a process and on a range of a shared memory segment--shared by all processes that attach to the segment. Moved this up from bottom of page and reworked 3) Establish initial, default memory policy as "local". 4) Clarify [IMO] several descriptions. 5) Correct erroneous text for numa_set_interleaved_mask(), numa_set_membind(), ... to match libnuma code. 6) Note that numa_alloc_interleaved() may fail if allowed nodes is constrained externally--i.e., by cpusets. But, I don't mention cpusets because I have no man page to reference. 7) size and start arguments of the *_alloc_* functions must be page aligned. No rounding up in the library or syscalls, except for size argument to mbind(). Signed-off-by: Lee Schermerhorn <lee.schermerhorn@xxxxxx> numa.3 | 307 ++++++++++++++++++++++++++++++++++++++++------------------------- 1 file changed, 191 insertions(+), 116 deletions(-) Index: numactl-1.0.3-rc1/numa.3 =================================================================== --- numactl-1.0.3-rc1.orig/numa.3 2008-02-10 20:54:17.000000000 -0500 +++ numactl-1.0.3-rc1/numa.3 2008-04-01 21:04:26.000000000 -0400 @@ -8,12 +8,12 @@ .\" manual under the conditions for verbatim copying, provided that the .\" entire resulting derived work is distributed under the terms of a .\" permission notice identical to this one. -.\" +.\" .\" Since the Linux kernel and libraries are constantly changing, this .\" manual page may be incorrect or out-of-date. The author(s) assume no .\" responsibility for errors or omissions, or for damages resulting from -.\" the use of the information contained herein. -.\" +.\" the use of the information contained herein. +.\" .\" Formatted or processed versions of this manual, if unaccompanied by .\" the source, must acknowledge the copyright and authors of this work. .TH NUMA 3 "December 2007" "SuSE Labs" "Linux Programmer's Manual" @@ -149,48 +149,70 @@ numa \- NUMA policy library .SH DESCRIPTION The -.I libnuma -library offers a simple programming interface to the +.I libnuma +library offers a simple programming interface to the NUMA (Non Uniform Memory Access) -policy supported by the +policy supported by the Linux kernel. On a NUMA architecture some memory areas have different latency or bandwidth than others. -Available policies are -page interleaving (i.e., allocate in a round-robin fashion from all, -or a subset, of the nodes on the system), -preferred node allocation (i.e., preferably allocate on a particular node), -local allocation (i.e., allocate on the node on which +Available policies are +page interleaving (i.e., allocate in a round-robin fashion from all, +or a subset, of the nodes on the system), +preferred node allocation (i.e., preferably allocate on a particular node), +local allocation (i.e., allocate on the node on which the thread is currently executing), or allocation only on specific nodes (i.e., allocate on some subset of the available nodes). -It is also possible to bind threads to specific nodes. +It is also possible to bind threads to specific nodes. + +Numa memory allocation policy may be specified as a per-thread attribute, +that is inherited by children threads and processes, or as an attribute +of a range of process virtual address space. +Numa memory policies specified for a range of virtual address space are +shared by all threads in the process. +Further more, memory policies specified for a range of a shared memory +attached using +.I shmat(2) +or +.I mmap(2) +from shmfs/hugetlbfs are shared by all processes that attach to that region. +Memory policies for shared disk backed file mappings are currently ignored. -Numa memory allocation policy is a per-thread attribute, but is -inherited by children. +The default memory allocation policy for threads and all memory range +is local allocation. +This assumes that no ancestor has installed a non-default policy. For setting a specific policy globally for all memory allocations in a process and its children it is easiest -to start it with the +to start it with the .BR numactl (8) utility. For more finegrained policy inside an application this library can be used. All numa memory allocation policy only takes effect when a page is actually -faulted into the address space of a process by accessing it. The +faulted into the address space of a process by accessing it. The .B numa_alloc_* functions take care of this automatically. -A -.I node -is defined as an area where all memory has the same speed as seen from -a particular CPU. A node can contain multiple CPUs. Caches are ignored for this definition. - -This library is only concerned about nodes and their memory and does not deal -with individual CPUs inside these nodes -(except for -.I numa_node_to_cpus -) +A +.I node +is defined as an area where all memory has the same speed as seen from +a particular CPU. +A node can contain multiple CPUs. +Caches are ignored for this definition. + +Most functions in this library are only concerned about numa nodes and +their memory. +The exceptions to this are: +.IR numa_node_to_cpus (), +.IR numa_bind (), +.IR numa_run_on_node (), +.IR numa_run_on_node_mask () +and +.IR numa_get_run_node_mask (). +These functions deal with the CPUs associated with numa nodes. +See the descriptions below for more information. Some of these functions accept or return a pointer to struct bitmask. A struct bitmask controls a bit map of arbitrary length containing a bit @@ -342,33 +364,43 @@ is not NULL, it used to return the amoun On error it returns \-1. .BR numa_node_size64 () -works the same as +works the same as .BR numa_node_size () -except that it returns values as -.I long long -instead of +except that it returns values as +.I long long +instead of .IR long . This is useful on 32-bit architectures with large nodes. .BR numa_preferred () -returns the preferred node of the current thread. +returns the preferred node of the current thread. This is the node on which the kernel preferably allocates memory, unless some other policy overrides this. +.\" TODO: results are misleading for MPOL_PREFERRED and may +.\" be incorrect for MPOL_BIND when Mel Gorman's twozonelist +.\" patches go in. In the latter case, we'd need to know the +.\" order of the current node's zonelist to return the correct +.\" node. Need to tighten this up with the syscall results. .BR numa_set_preferred () sets the preferred node for the current thread to .IR node . -The preferred node is the node on which memory is -preferably allocated before falling back to other nodes. -The default is to use the node on which the process is currently running -(local policy). Passing a \-1 argument is equivalent to +The system will attempt to allocate memory from the preferred node, +but will fall back to other nodes if no memory is available on the +the preferred node. +Passing a +.I node +of \-1 argument specifies local allocation and is equivalent to +calling .BR numa_set_localalloc (). .BR numa_get_interleave_mask () -returns the current interleave mask. +returns the current interleave mask if the thread's memory allocation policy +is page interleaved. +Otherwise, this function returns an empty mask. .BR numa_set_interleave_mask () -sets the memory interleave mask for the current thread to +sets the memory interleave mask for the current thread to .IR nodemask . All new memory allocations are page interleaved over all nodes in the interleave mask. Interleaving @@ -377,20 +409,33 @@ can be turned off again by passing an em The page interleaving only occurs on the actual page fault that puts a new page into the current address space. It is also only a hint: the kernel will fall back to other nodes if no memory is available on the interleave -target. This is a low level -function, it may be more convenient to use the higher level functions like -.BR numa_alloc_interleaved () -or -.BR numa_alloc_interleaved_subset (). +target. +.\" NOTE: the following is not really the case. this function sets the +.\" thread policy for all future allocations, including stack, bss, ... +.\" The functions specified in this sentence actually allocate a new memory +.\" range [via mmap()]. This is quite a different thing. Suggest we drop +.\" this. +.\" This is a low level +.\" function, it may be more convenient to use the higher level functions like +.\" .BR numa_alloc_interleaved () +.\" or +.\" .BR numa_alloc_interleaved_subset (). .BR numa_interleave_memory () interleaves .I size bytes of memory page by page from .I start -on nodes +on nodes specified in .IR nodemask . -This is a lower level function to interleave not yet faulted in but allocated +The +.I size +argument will be rounded up to a multiple of the system page size. +If +.I nodemask +contains nodes that are externally denied to this process, +this call will fail. +This is a lower level function to interleave allocated but not yet faulted in memory. Not yet faulted in means the memory is allocated using .BR mmap (2) or @@ -409,13 +454,14 @@ flag is true then the operation will cau pages in the mapping that do not follow the policy. .BR numa_bind () -binds the current thread and its children to the nodes -specified in +binds the current thread and its children to the nodes +specified in .IR nodemask . They will only run on the CPUs of the specified nodes and only be able to allocate memory from them. This function is equivalent to calling .\" FIXME checkme +.\" This is the case. --lts .I numa_run_on_node_mask(nodemask) followed by .IR numa_set_membind(nodemask) . @@ -427,45 +473,60 @@ and the syscall. .BR numa_set_localalloc () -sets a local memory allocation policy for the calling thread. -Memory is preferably allocated on the node on which the thread is -currently running. +sets the memory allocation policy for the calling thread to +local allocation. +In this mode, the preferred node for memory allocation is +effectively the node where the thread is executing at the +time of a page allocation. .BR numa_set_membind () sets the memory allocation mask. -The thread will only allocate memory from the nodes set in +The thread will only allocate memory from the nodes set in .IR nodemask . -Passing an argument of -.I numa_no_nodes -or -.I numa_all_nodes -turns off memory binding to specific nodes. +Passing an empty +.I nodemask +or a +.I nodemask +that contains nodes other than those in the mask returned by +.IR numa_get_mems_allowed () +will result in an error. .BR numa_get_membind () returns the mask of nodes from which memory can currently be allocated. -If the returned mask is equal to -.I numa_no_nodes -or +If the returned mask is equal to .IR numa_all_nodes , -then all nodes are available for memory allocation. +then memory allocation is allowed from all nodes. .BR numa_alloc_onnode () -allocates memory on a specific node. This function is relatively slow -and allocations are rounded up to the system page size. +allocates memory on a specific node. +The +.I size +argument will be rounded up to a multiple of the system page size. +if the specified +.I node +is externally denied to this process, this call will fail. +This function is relatively slow compared to the +.IR malloc (3), +family of functions. The memory must be freed with .BR numa_free (). -On errors NULL is returned. +On errors NULL is returned. .BR numa_alloc_local () allocates .I size -bytes of memory on the local node. This function is relatively slow -and allocations are rounded up to the system page size. +bytes of memory on the local node. +The +.I size +argument will be rounded up to a multiple of the system page size. +This function is relatively slow compared to the +.IR malloc (3) +family of functions. The memory must be freed -with +with .BR numa_free (). -On errors NULL is returned. +On errors NULL is returned. .BR numa_alloc_interleaved () allocates @@ -479,32 +540,54 @@ The allocated memory must be freed with On error, NULL is returned. .BR numa_alloc_interleaved_subset () -is like -.BR numa_alloc_interleaved () -except that it also accepts a mask of the nodes to interleave on. +attempts to allocate +.I size +bytes of memory page interleaved on all nodes. +The +.I size +argument will be rounded up to a multiple of the system page size. +The nodes on which a process is allowed to allocate memory may +be constrained externally. +If this is the case, this function may fail. +This function is relatively slow compare to +.IR malloc (3), +family of functions and should only be used for large areas consisting +of multiple pages. +The interleaving works at page level and will only show an effect when the +area is large. +The allocated memory must be freed with +.BR numa_free (). On error, NULL is returned. .BR numa_alloc () allocates -.I size -bytes of memory with the current NUMA policy. This function is relatively slow -and allocations are rounded up to the system page size. +.I size +bytes of memory with the current NUMA policy. +The +.I size +argument will be rounded up to a multiple of the system page size. +This function is relatively slow compare to the +.IR malloc (3) +family of functions. The memory must be freed -with +with .BR numa_free (). -On errors NULL is returned. +On errors NULL is returned. .BR numa_free () -frees +frees .I size -bytes of memory starting at +bytes of memory starting at .IR start , -allocated by the -.B numa_alloc_* +allocated by the +.B numa_alloc_* functions above. +The +.I size +argument will be rounded up to a multiple of the system page size. .BR numa_run_on_node () -runs the current thread and its children +runs the current thread and its children on a specific node. They will not migrate to CPUs of other nodes until the node affinity is reset with a new call to .BR numa_run_on_node_mask (). @@ -515,12 +598,14 @@ On success, 0 is returned; on error \-1 is set to indicate the error. .BR numa_run_on_node_mask () -runs the current thread and its children only on nodes specified in +runs the current thread and its children only on nodes specified in .IR nodemask . They will not migrate to CPUs of other nodes until the node affinity is reset with a new call to -.BR numa_run_on_node_mask (). -Passing +.BR numa_run_on_node_mask () +or +.BR numa_run_on_node (). +Passing .I numa_all_nodes permits the kernel to schedule on all nodes again. On success, 0 is returned; on error \-1 is returned, and @@ -531,22 +616,22 @@ is set to indicate the error. returns the mask of nodes that the current thread is allowed to run on. .BR numa_tonode_memory () -put memory on a specific node. The constraints described for +put memory on a specific node. The constraints described for .BR numa_interleave_memory () apply here too. .BR numa_tonodemask_memory () -put memory on a specific set of nodes. The constraints described for +put memory on a specific set of nodes. The constraints described for .BR numa_interleave_memory () -apply here too. +apply here too. .BR numa_setlocal_memory () -locates memory on the current node. The constraints described for +locates memory on the current node. The constraints described for .BR numa_interleave_memory () apply here too. .BR numa_police_memory () -locates memory with the current NUMA policy. The constraints described for +locates memory with the current NUMA policy. The constraints described for .BR numa_interleave_memory () apply here too. @@ -560,8 +645,8 @@ kernel version of or newer. .BR numa_set_bind_policy () -specifies whether calls that bind memory to a specific node should -use the preferred policy or a strict policy. +specifies whether calls that bind memory to a specific node should +use the preferred policy or a strict policy. The preferred policy allows the kernel to allocate memory on other nodes when there isn't enough free on the target node. strict will fail the allocation in that case. @@ -571,7 +656,7 @@ the first node in some kernel versions. .BR numa_set_strict () sets a flag that says whether the functions allocating on specific -nodes should use use a strict policy. Strict means the allocation +nodes should use use a strict policy. Strict means the allocation will fail if the memory cannot be allocated on the target node. Default operation is to fall back to other nodes. This doesn't apply to interleave and default. @@ -755,18 +840,18 @@ mask can be set by calls to .BR numa_bitmask_setbit(). .BR numa_error () -is a weak internal -.I libnuma -function that can be overridden by the -user program. +is a +.I libnuma +internal function that can be overridden by the +user program. This function is called with a .I char * argument when a .I libnuma function fails. -Overriding the weak library definition +Overriding the library internal definition makes it possible to specify a different error handling strategy -when a +when a .I libnuma function fails. It does not affect .BR numa_available (). @@ -785,40 +870,30 @@ The default value of is zero. .BR numa_warn () -is a weak internal -.I libnuma -function that can be also overridden -by the user program. -It is called to warn the user when a +is a +.I libnuma +internal function that can be also overridden +by the user program. +It is called to warn the user when a .I libnuma function encounters a non-fatal error. The default implementation -prints a warning to +prints a warning to .IR stderr . The first argument is a unique -number identifying each warning. After that there is a -.BR printf (3)-style +number identifying each warning. After that there is a +.BR printf (3)-style format string and a variable number of arguments. .SH THREAD SAFETY .I numa_set_bind_policy and .I numa_exit_on_error -are process global. The other calls are thread safe. - -Memory policy set for memory areas is shared by all threads -of the process. Memory policy is also -shared by other processes mapping the same memory using -.I shmat(2) -or -.I mmap(2) -from shmfs/hugetlbfs. It is not shared for -disk backed file mappings right now although that may change in the future. - +are process global. The other calls are thread safe. .SH COPYRIGHT Copyright 2002, 2004, 2007, Andi Kleen, SuSE Labs. -.I libnuma +.I libnuma is under the GNU Lesser General Public License, v2.1. .SH SEE ALSO -- To unsubscribe from this list: send the line "unsubscribe linux-numa" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html