Re: [PATCH 4/4] hugetlb: add per node hstate attributes

David Rientjes <rientjes@xxxxxxxxxx> · Thu, 30 Jul 2009 12:39:09 -0700

On Wed, Jul 29, 2009 at 11:12 AM, Lee
Schermerhorn<lee.schermerhorn@xxxxxx> wrote:
> PATCH/RFC 4/4 hugetlb:  register per node hugepages attributes
>
> Against: 2.6.31-rc3-mmotm-090716-1432
> atop the previously posted alloc_bootmem_hugepages fix.
> [http://marc.info/?l=linux-mm&m=124775468226290&w=4]
>
> This patch adds the per huge page size control/query attributes
> to the per node sysdevs:
>
> /sys/devices/system/node/node<ID>/hugepages/hugepages-<size>/
>        nr_hugepages       - r/w
>        free_huge_pages    - r/o
>        surplus_huge_pages - r/o
>
> The patch attempts to re-use/share as much of the existing
> global hstate attribute initialization and handling as possible.
> Throughout, a node id < 0 indicates global hstate parameters.
>
> Note:  computation of "min_count" in set_max_huge_pages() for a
> specified node needs careful review.
>
> Issue:  dependency of base driver [node] dependency on hugetlbfs module.
> We want to keep all of the hstate attribute registration and handling
> in the hugetlb module.  However, we need to call into this code to
> register the per node hstate attributes on node hot plug.
>
> With this patch:
>
> (me):ls /sys/devices/system/node/node0/hugepages/hugepages-2048kB
> ./  ../  free_hugepages  nr_hugepages  surplus_hugepages
>
> Starting from:
> Node 0 HugePages_Total:     0
> Node 0 HugePages_Free:      0
> Node 0 HugePages_Surp:      0
> Node 1 HugePages_Total:     0
> Node 1 HugePages_Free:      0
> Node 1 HugePages_Surp:      0
> Node 2 HugePages_Total:     0
> Node 2 HugePages_Free:      0
> Node 2 HugePages_Surp:      0
> Node 3 HugePages_Total:     0
> Node 3 HugePages_Free:      0
> Node 3 HugePages_Surp:      0
> vm.nr_hugepages = 0
>
> Allocate 16 persistent huge pages on node 2:
> (me):echo 16 >/sys/devices/system/node/node2/hugepages/hugepages-2048kB/nr_hugepages
>
> Yields:
> Node 0 HugePages_Total:     0
> Node 0 HugePages_Free:      0
> Node 0 HugePages_Surp:      0
> Node 1 HugePages_Total:     0
> Node 1 HugePages_Free:      0
> Node 1 HugePages_Surp:      0
> Node 2 HugePages_Total:    16
> Node 2 HugePages_Free:     16
> Node 2 HugePages_Surp:      0
> Node 3 HugePages_Total:     0
> Node 3 HugePages_Free:      0
> Node 3 HugePages_Surp:      0
> vm.nr_hugepages = 16
>
> Global controls work as expected--reduce pool to 8 persistent huge pages:
> (me):echo 8 >/sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
>
> Node 0 HugePages_Total:     0
> Node 0 HugePages_Free:      0
> Node 0 HugePages_Surp:      0
> Node 1 HugePages_Total:     0
> Node 1 HugePages_Free:      0
> Node 1 HugePages_Surp:      0
> Node 2 HugePages_Total:     8
> Node 2 HugePages_Free:      8
> Node 2 HugePages_Surp:      0
> Node 3 HugePages_Total:     0
> Node 3 HugePages_Free:      0
> Node 3 HugePages_Surp:      0
>
>
>
>
>
> Signed-off-by: Lee Schermerhorn <lee.schermerhorn@xxxxxx>
>

Thank you very much for doing this.

Google is going to need this support regardless of what finally gets
merged into mainline, so I'm thrilled you've implemented this version.

I hugely (get it? hugely :) favor this approach because it's much
simpler to reserve hugepages from this interface than a mempolicy
based approach once hugepages have already been allocated before.  For
cpusets users in particular, jobs typically get allocated on a subset
of nodes that are required for that application and they don't last
for the duration of the machine's uptime.  When a job exits and the
nodes need to be reallocated to a new cpuset, it may be a very
different set of mems based on the memory requirements or interleave
optimizations for the new job.  Allocating resources such as hugepages
are possible in this scenario via mempolicies, but it would require a
temporary mempolicy to then allocate additional hugepages from which
seems like an unnecessary requirement, especially if the job scheduler
that is governing hugepage allocations already has a mempolicy of its
own.

So it's my opinion that the mempolicy based approach is very
appropriate for tasks that allocate hugepages itself.  Other users,
particularly cpusets users, however, would require preallocation of
hugepages prior to a job being scheduled in which case a temporary
mempolicy would be required for that job scheduler.  That seems like
an inconvenience when the entire state of the system's hugepages could
easily be governed with the per-node hstate attributes and a slightly
modified user library.
--
To unsubscribe from this list: send the line "unsubscribe linux-numa" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html