On Wed, Jul 29, 2009 at 11:12 AM, Lee Schermerhorn<lee.schermerhorn@xxxxxx> wrote: > PATCH/RFC 4/4 hugetlb: register per node hugepages attributes > > Against: 2.6.31-rc3-mmotm-090716-1432 > atop the previously posted alloc_bootmem_hugepages fix. > [http://marc.info/?l=linux-mm&m=124775468226290&w=4] > > This patch adds the per huge page size control/query attributes > to the per node sysdevs: > > /sys/devices/system/node/node<ID>/hugepages/hugepages-<size>/ > nr_hugepages - r/w > free_huge_pages - r/o > surplus_huge_pages - r/o > > The patch attempts to re-use/share as much of the existing > global hstate attribute initialization and handling as possible. > Throughout, a node id < 0 indicates global hstate parameters. > > Note: computation of "min_count" in set_max_huge_pages() for a > specified node needs careful review. > > Issue: dependency of base driver [node] dependency on hugetlbfs module. > We want to keep all of the hstate attribute registration and handling > in the hugetlb module. However, we need to call into this code to > register the per node hstate attributes on node hot plug. > > With this patch: > > (me):ls /sys/devices/system/node/node0/hugepages/hugepages-2048kB > ./ ../ free_hugepages nr_hugepages surplus_hugepages > > Starting from: > Node 0 HugePages_Total: 0 > Node 0 HugePages_Free: 0 > Node 0 HugePages_Surp: 0 > Node 1 HugePages_Total: 0 > Node 1 HugePages_Free: 0 > Node 1 HugePages_Surp: 0 > Node 2 HugePages_Total: 0 > Node 2 HugePages_Free: 0 > Node 2 HugePages_Surp: 0 > Node 3 HugePages_Total: 0 > Node 3 HugePages_Free: 0 > Node 3 HugePages_Surp: 0 > vm.nr_hugepages = 0 > > Allocate 16 persistent huge pages on node 2: > (me):echo 16 >/sys/devices/system/node/node2/hugepages/hugepages-2048kB/nr_hugepages > > Yields: > Node 0 HugePages_Total: 0 > Node 0 HugePages_Free: 0 > Node 0 HugePages_Surp: 0 > Node 1 HugePages_Total: 0 > Node 1 HugePages_Free: 0 > Node 1 HugePages_Surp: 0 > Node 2 HugePages_Total: 16 > Node 2 HugePages_Free: 16 > Node 2 HugePages_Surp: 0 > Node 3 HugePages_Total: 0 > Node 3 HugePages_Free: 0 > Node 3 HugePages_Surp: 0 > vm.nr_hugepages = 16 > > Global controls work as expected--reduce pool to 8 persistent huge pages: > (me):echo 8 >/sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages > > Node 0 HugePages_Total: 0 > Node 0 HugePages_Free: 0 > Node 0 HugePages_Surp: 0 > Node 1 HugePages_Total: 0 > Node 1 HugePages_Free: 0 > Node 1 HugePages_Surp: 0 > Node 2 HugePages_Total: 8 > Node 2 HugePages_Free: 8 > Node 2 HugePages_Surp: 0 > Node 3 HugePages_Total: 0 > Node 3 HugePages_Free: 0 > Node 3 HugePages_Surp: 0 > > > > > > Signed-off-by: Lee Schermerhorn <lee.schermerhorn@xxxxxx> > Thank you very much for doing this. Google is going to need this support regardless of what finally gets merged into mainline, so I'm thrilled you've implemented this version. I hugely (get it? hugely :) favor this approach because it's much simpler to reserve hugepages from this interface than a mempolicy based approach once hugepages have already been allocated before. For cpusets users in particular, jobs typically get allocated on a subset of nodes that are required for that application and they don't last for the duration of the machine's uptime. When a job exits and the nodes need to be reallocated to a new cpuset, it may be a very different set of mems based on the memory requirements or interleave optimizations for the new job. Allocating resources such as hugepages are possible in this scenario via mempolicies, but it would require a temporary mempolicy to then allocate additional hugepages from which seems like an unnecessary requirement, especially if the job scheduler that is governing hugepage allocations already has a mempolicy of its own. So it's my opinion that the mempolicy based approach is very appropriate for tasks that allocate hugepages itself. Other users, particularly cpusets users, however, would require preallocation of hugepages prior to a job being scheduled in which case a temporary mempolicy would be required for that job scheduler. That seems like an inconvenience when the entire state of the system's hugepages could easily be governed with the per-node hstate attributes and a slightly modified user library. -- To unsubscribe from this list: send the line "unsubscribe linux-numa" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html