On Thu 30-08-18 09:22:21, Zi Yan wrote: > On 30 Aug 2018, at 3:00, Michal Hocko wrote: > > > On Wed 29-08-18 18:54:23, Zi Yan wrote: > > [...] > >> I tested it against Linus’s tree with “memhog -r3 130g” in a two-socket machine with 128GB memory on > >> each node and got the results below. I expect this test should fill one node, then fall back to the other. > >> > >> 1. madvise(MADV_HUGEPAGE) + defrag = {always, madvise, defer+madvise}: > >> no swap, THPs are allocated in the fallback node. > >> 2. madvise(MADV_HUGEPAGE) + defrag = defer: pages got swapped to the > >> disk instead of being allocated in the fallback node. > >> 3. no madvise, THP is on by default + defrag = {always, defer, > >> defer+madvise}: pages got swapped to the disk instead of being > >> allocated in the fallback node. > >> 4. no madvise, THP is on by default + defrag = madvise: no swap, base > >> pages are allocated in the fallback node. > >> > >> The result 2 and 3 seems unexpected, since pages should be allocated in the fallback node. > >> > >> The reason, as Andrea mentioned in his email, is that the combination > >> of __THIS_NODE and __GFP_DIRECT_RECLAIM (plus __GFP_KSWAPD_RECLAIM > >> from this experiment). > > > > But we do not set __GFP_THISNODE along with __GFP_DIRECT_RECLAIM AFAICS. > > We do for __GFP_KSWAPD_RECLAIM though and I guess that it is expected to > > see kswapd do the reclaim to balance the node. If the node is full of > > anonymous pages then there is no other way than swap out. > > GFP_TRANSHUGE implies __GFP_DIRECT_RECLAIM. When no madvise is given, THP is on > + defrag=always, gfp_mask has __GFP_THISNODE and __GFP_DIRECT_RECLAIM, so swapping > can be triggered. Yes, but the setup tells that you are willing to pay price to get a THP. defered=always uses that special __GFP_NORETRY (unless it is madvised mapping) that should back off if the compaction failed recently. How much that reduces the reclaim is not really clear to me right now to be honest. > The key issue here is that “memhog -r3 130g” uses the default memory policy (MPOL_DEFAULT), > which should allow page allocation fallback to other nodes, but as shown in > result 3, swapping is triggered instead of page allocation fallback. Well, I guess this really depends. Fallback to a different node might be seen as a bad thing and worse than the reclaim on the local node. > >> __THIS_NODE uses ZONELIST_NOFALLBACK, which > >> removes the fallback possibility and __GFP_*_RECLAIM triggers page > >> reclaim in the first page allocation node when fallback nodes are > >> removed by ZONELIST_NOFALLBACK. > > > > Yes but the point is that the allocations which use __GFP_THISNODE are > > optimistic so they shouldn't fallback to remote NUMA nodes. > > This can be achieved by using MPOL_BIND memory policy which restricts > nodemask in struct alloc_context for user space memory allocations. Yes, but that requires and explicit NUMA handling. And we are trying to handle those cases which do not really give a damn and just want to use THP if it is available or try harder when they ask by using madvise. > >> IMHO, __THIS_NODE should not be used for user memory allocation at > >> all, since it fights against most of memory policies. But kernel > >> memory allocation would need it as a kernel MPOL_BIND memory policy. > > > > __GFP_THISNODE is indeed an ugliness. I would really love to get rid of > > it here. But the problem is that optimistic THP allocations should > > prefer a local node because a remote node might easily offset the > > advantage of the THP. I do not have a great idea how to achieve that > > without __GFP_THISNODE though. > > MPOL_PREFERRED memory policy can be used to achieve this optimistic > THP allocation for user space. Even with the default memory policy, > local memory node will be used first until it is full. It seems to > me that __GFP_THISNODE is not necessary if a proper memory policy is > used. > > Let me know if I miss anything. Thanks. You are missing that we are trying to define a sensible model for those who do not really care about mempolicies. THP shouldn't cause more harm than good for those. I wish we could come up with a remotely sane and comprehensible model. That means that you know how hard the allocator tries to get a THP for you depending on the defrag configuration, your memory policy and your madvise setting. The easiest one I can think of is to - always follow mempolicy when specified because you asked for it explicitly - stay node local and low latency for the light THP defrag mode (defrag, madvise without hint and none) because THP is a nice to have - if the defrag mode is always then you are willing to pay the latency price but off-node might be still a no-no. - allow fallback for madvised mappings because you really want THP. If you care about specific numa placement then combine with the mempolicy. As you can see I do not really mention anything about the direct reclaim because that is just an implementation detail of the page allocator and compaction interaction. Maybe you can formulate a saner matrix with all the available modes that we have. Anyway, I guess we can agree that (almost) unconditional __GFP_THISNODE is clearly wrong and we should address that first. Either Andrea's option 2) patch or mine which does the similar thing except at the proper layer (I believe). We can continue discussing other odd cases on top I guess. Unless somebody has much brighter idea, of course. -- Michal Hocko SUSE Labs