On Thu, 2009-07-30 at 12:18 +0100, Mel Gorman wrote: > On Wed, Jul 29, 2009 at 01:54:50PM -0400, Lee Schermerhorn wrote: > > PATCH 0/4 hugetlb: constrain allocation/free based on task mempolicy > > > > I'm sending these out again, slightly revised, for comparison > > with a 3rd alternative for controlling where persistent huge > > pages are allocated which I'll send out as a separate series. > > > > Against: 2.6.31-rc3-mmotm-090716-1432 > > atop previously submitted "alloc_bootmem_huge_pages() fix" > > [http://marc.info/?l=linux-mm&m=124775468226290&w=4] > > > > This is V3 of a series of patches to constrain the allocation and > > freeing of persistent huge pages using the task NUMA mempolicy of > > the task modifying "nr_hugepages". This series is based on Mel > > Gorman's suggestion to use task mempolicy. One of the benefits > > of this method is that it does not *require* modification to > > hugeadm(8) to use this feature. > > > > V3 factors the "rework" of the hstate_next_node_to_{alloc|free} > > functions out of the patch to derive huge pages nodes_allowed > > from mempolicy, and moves it before the patch to add nodemasks > > to the alloc/free functions. See patch patch 1/4. > > > > A couple of limitations [still] in this version: > > > > 1) I haven't implemented a boot time parameter to constrain the > > boot time allocation of huge pages. This can be added if > > anyone feels strongly that it is required. > > > > 2) I have not implemented a per node nr_overcommit_hugepages as > > David Rientjes and I discussed earlier. Again, this can be > > added and specific nodes can be addressed using the mempolicy > > as this series does for allocation and free. However, after > > some experience with the libhugetlbfs test suite, specifically > > attempting to run the test suite constrained by mempolicy and > > a cpuset, I'm thinking that per node overcommit limits might > > not be such a good idea. This would require an application > > [or the library] to sum the per node limits over the allowed > > nodes and possibly compare to global limits to determine the > > available resources. Per cpuset limits might work better. > > This are requires more investigation, but this patch series > > doesn't seem to make things worse than they already are in > > this regard. > > > > There needs to be a third limitation listed here and preferably added as a > note in the documentation or better yet, warned about explicitly at runtime. > > 3) hugetlb reservations are not mempolicy aware. If an application runs > that only has access to a subset of nodes with hugepages, it may encounter > stability problems as mmap() will return success and potentially fail a > page fault later > > I'm ok with that for the moment but it'll be something that eventually > needs to be addressed. However, I don't consider it a prequisite for > this patchset because there is obvious utility for administrators that > want to run a limited number of hugepage applications all on the same > node that would be covered by this patch. Mel: That's not exactly a limitation with this series. The fact that hugetlb reservations are not mempolicy aware will bite you independent of this patch series. I think we want reservations to be independent of mempolicy, because unless the application is restricted to a single node, we don't know from which node it will attempt to allocate a huge page at fault time. As I recall, one of the purposes of reservations was to guarantee successful allocation at fault time when we removed the allocation at mmap/attach time. Since we don't know from which of the allowed nodes a task will attempt to allocate at fault time, it's difficult to charge a reserve against any given node. This will continue to be a problem unless we have sufficient capacity in the allowed nodes to satisfy the reserve count. Same problem exists for overcommit huge pages. > > Other than the possible memory leak in patch 3 which I've commented on there, > I'm fine with the patchset. Yeah, I apparently dropped that while reworking the series. I'll post an incremental fix. I also need to ping David R about the "3rd alternative"--per node attributes--as he was favoring that method. I'm hoping we can obtain consensus on the approach and then focus on refining that one. Later, Lee -- To unsubscribe from this list: send the line "unsubscribe linux-numa" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html