On 30 Aug 2018, at 3:00, Michal Hocko wrote: > On Wed 29-08-18 18:54:23, Zi Yan wrote: > [...] >> I tested it against Linus’s tree with “memhog -r3 130g” in a two-socket machine with 128GB memory on >> each node and got the results below. I expect this test should fill one node, then fall back to the other. >> >> 1. madvise(MADV_HUGEPAGE) + defrag = {always, madvise, defer+madvise}: >> no swap, THPs are allocated in the fallback node. >> 2. madvise(MADV_HUGEPAGE) + defrag = defer: pages got swapped to the >> disk instead of being allocated in the fallback node. >> 3. no madvise, THP is on by default + defrag = {always, defer, >> defer+madvise}: pages got swapped to the disk instead of being >> allocated in the fallback node. >> 4. no madvise, THP is on by default + defrag = madvise: no swap, base >> pages are allocated in the fallback node. >> >> The result 2 and 3 seems unexpected, since pages should be allocated in the fallback node. >> >> The reason, as Andrea mentioned in his email, is that the combination >> of __THIS_NODE and __GFP_DIRECT_RECLAIM (plus __GFP_KSWAPD_RECLAIM >> from this experiment). > > But we do not set __GFP_THISNODE along with __GFP_DIRECT_RECLAIM AFAICS. > We do for __GFP_KSWAPD_RECLAIM though and I guess that it is expected to > see kswapd do the reclaim to balance the node. If the node is full of > anonymous pages then there is no other way than swap out. GFP_TRANSHUGE implies __GFP_DIRECT_RECLAIM. When no madvise is given, THP is on + defrag=always, gfp_mask has __GFP_THISNODE and __GFP_DIRECT_RECLAIM, so swapping can be triggered. The key issue here is that “memhog -r3 130g” uses the default memory policy (MPOL_DEFAULT), which should allow page allocation fallback to other nodes, but as shown in result 3, swapping is triggered instead of page allocation fallback. > >> __THIS_NODE uses ZONELIST_NOFALLBACK, which >> removes the fallback possibility and __GFP_*_RECLAIM triggers page >> reclaim in the first page allocation node when fallback nodes are >> removed by ZONELIST_NOFALLBACK. > > Yes but the point is that the allocations which use __GFP_THISNODE are > optimistic so they shouldn't fallback to remote NUMA nodes. This can be achieved by using MPOL_BIND memory policy which restricts nodemask in struct alloc_context for user space memory allocations. > >> IMHO, __THIS_NODE should not be used for user memory allocation at >> all, since it fights against most of memory policies. But kernel >> memory allocation would need it as a kernel MPOL_BIND memory policy. > > __GFP_THISNODE is indeed an ugliness. I would really love to get rid of > it here. But the problem is that optimistic THP allocations should > prefer a local node because a remote node might easily offset the > advantage of the THP. I do not have a great idea how to achieve that > without __GFP_THISNODE though. MPOL_PREFERRED memory policy can be used to achieve this optimistic THP allocation for user space. Even with the default memory policy, local memory node will be used first until it is full. It seems to me that __GFP_THISNODE is not necessary if a proper memory policy is used. Let me know if I miss anything. Thanks. — Best Regards, Yan Zi
Attachment:
signature.asc
Description: OpenPGP digital signature