Re: [PATCH 2/2] mm, thp: consolidate THP gfp handling into alloc_hugepage_direct_gfpmask

"Zi Yan" <zi.yan@xxxxxxxxxxxxxx> · Thu, 04 Oct 2018 17:49:47 -0400

On 4 Oct 2018, at 16:17, David Rientjes wrote:

> On Wed, 26 Sep 2018, Kirill A. Shutemov wrote:
>
>> On Tue, Sep 25, 2018 at 02:03:26PM +0200, Michal Hocko wrote:
>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>> index c3bc7e9c9a2a..c0bcede31930 100644
>>> --- a/mm/huge_memory.c
>>> +++ b/mm/huge_memory.c
>>> @@ -629,21 +629,40 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf,
>>>   *	    available
>>>   * never: never stall for any thp allocation
>>>   */
>>> -static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma)
>>> +static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma, unsigned long addr)
>>>  {
>>>  	const bool vma_madvised = !!(vma->vm_flags & VM_HUGEPAGE);
>>> +	gfp_t this_node = 0;
>>> +
>>> +#ifdef CONFIG_NUMA
>>> +	struct mempolicy *pol;
>>> +	/*
>>> +	 * __GFP_THISNODE is used only when __GFP_DIRECT_RECLAIM is not
>>> +	 * specified, to express a general desire to stay on the current
>>> +	 * node for optimistic allocation attempts. If the defrag mode
>>> +	 * and/or madvise hint requires the direct reclaim then we prefer
>>> +	 * to fallback to other node rather than node reclaim because that
>>> +	 * can lead to excessive reclaim even though there is free memory
>>> +	 * on other nodes. We expect that NUMA preferences are specified
>>> +	 * by memory policies.
>>> +	 */
>>> +	pol = get_vma_policy(vma, addr);
>>> +	if (pol->mode != MPOL_BIND)
>>> +		this_node = __GFP_THISNODE;
>>> +	mpol_cond_put(pol);
>>> +#endif
>>
>> I'm not very good with NUMA policies. Could you explain in more details how
>> the code above is equivalent to the code below?
>>
>
> It breaks mbind() because new_page() is now using numa_node_id() to
> allocate migration targets for instead of using the mempolicy.  I'm not
> sure that this patch was tested for mbind().

I do not see mbind() is broken. With both patches applied, I ran
"numactl -N 0 memhog -r1 4096m membind 1" and saw all pages are allocated
in Node 1 not Node 0, which is returned by numa_node_id().

From the source code, in alloc_pages_vma(), the nodemask is generated
from the memory policy (i.e. mbind in the case above), which only has
the nodes specified by mbind(). Then, __alloc_pages_nodemask() only uses
the zones from the nodemask. The numa_node_id() return value will be
ignored in the actual page allocation process if mbind policy is applied.

Let me know if I miss anything.

--
Best Regards
Yan Zi
Attachment:
signature.asc

Description: OpenPGP digital signature