Re: [RFC PATCH 1/3] mm, numa: rework do_pages_move

Anshuman Khandual <khandual@xxxxxxxxxxxxxxxxxx> · Wed, 3 Jan 2018 15:06:49 +0530

On 01/03/2018 02:28 PM, Michal Hocko wrote:
> On Wed 03-01-18 14:12:17, Anshuman Khandual wrote:
>> On 12/08/2017 09:45 PM, Michal Hocko wrote:
> [...]

[...]

>>
>> This reuses the existing page allocation helper from migrate_pages() system
>> call. But all these allocator helper names for migrate_pages() function are
>> really confusing. Even in this case alloc_new_node_page and the original
>> new_node_page() which is still getting used in do_migrate_range() sounds
>> similar even if their implementation is quite different. IMHO either all of
>> them should be moved to the header file with proper differentiating names
>> or let them be there in their respective files with these generic names and
>> clean them up later.
> 
> I believe I've tried that but I couldn't make them into a single header
> file easily because of header file dependencies. I agree that their
> names are quite confusing. Feel free to send a patch to clean this up.

Sure. Will try once this one gets into mmotm.

[...]

>>
>>
>> Just a nit. new_page_node() and store_status() seems different. Then why
>> the git diff looks so clumsy.
> 
> Kirill was suggesting to use --patience to general the diff which leads
> to a slightly better output. It has been posted as a separate email [1].
> Maybe you will find that one easier to review.
> 
> [1] http://lkml.kernel.org/r/20171213143948.GM25185@xxxxxxxxxxxxxx

Yeah it does look better.

[...]

>>> -		return thp;
>>> -	} else
>>> -		return __alloc_pages_node(pm->node,
>>> -				GFP_HIGHUSER_MOVABLE | __GFP_THISNODE, 0);
>>> +	err = migrate_pages(pagelist, alloc_new_node_page, NULL, node,
>>> +			MIGRATE_SYNC, MR_SYSCALL);
>>> +	if (err)
>>> +		putback_movable_pages(pagelist);
>>> +	return err;
>>>  }
>>
>> Even this one. IIUC, do_move_pages_to_node() migrate a chunk of pages
>> at a time which belong to the same target node. Perhaps the name should
>> suggest so. All these helper page migration helper functions sound so
>> similar.
> 
> What do you suggest? I find do_move_pages_to_node quite explicit on its
> purpose.

Sure. Not a big deal.

[...]

>>>  {
>>> +	struct vm_area_struct *vma;
>>> +	struct page *page;
>>> +	unsigned int follflags;
>>>  	int err;
>>> -	struct page_to_node *pp;
>>> -	LIST_HEAD(pagelist);
>>>  
>>>  	down_read(&mm->mmap_sem);
>>
>> Holding mmap_sem for individual pages makes sense. Current
>> implementation is holding it for an entire batch.
> 
> I didn't bother to optimize this path to be honest. It is true that lock
> batching can lead to improvements but that would complicate the code
> (how many patches to batch?) so I've left that for later if somebody
> actually sees any problem.
> 
>>> +	err = -EFAULT;
>>> +	vma = find_vma(mm, addr);
>>> +	if (!vma || addr < vma->vm_start || !vma_migratable(vma))
>>
>> While here, should not we add 'addr > vma->vm_end' into this condition ?
> 
> No. See what find_vma does.
> 

Right.

> [...]
> 
> Please cut out the quoted reply to minimum

Sure will do.

> 
>>> @@ -1593,79 +1556,80 @@ static int do_pages_move(struct mm_struct *mm, nodemask_t task_nodes,
>>>  			 const int __user *nodes,
>>>  			 int __user *status, int flags)
>>>  {
>>> -	struct page_to_node *pm;
>>> -	unsigned long chunk_nr_pages;
>>> -	unsigned long chunk_start;
>>> -	int err;
>>> -
>>> -	err = -ENOMEM;
>>> -	pm = (struct page_to_node *)__get_free_page(GFP_KERNEL);
>>> -	if (!pm)
>>> -		goto out;
>>> +	int chunk_node = NUMA_NO_NODE;
>>> +	LIST_HEAD(pagelist);
>>> +	int chunk_start, i;
>>> +	int err = 0, err1;
>>
>> err init might not be required, its getting assigned to -EFAULT right away.
> 
> No, nr_pages might be 0 AFAICS.

Right but there is another err = 0 after the for loop.

> 
> [...]
>>> +		if (chunk_node == NUMA_NO_NODE) {
>>> +			chunk_node = node;
>>> +			chunk_start = i;
>>> +		} else if (node != chunk_node) {
>>> +			err = do_move_pages_to_node(mm, &pagelist, chunk_node);
>>> +			if (err)
>>> +				goto out;
>>> +			err = store_status(status, chunk_start, chunk_node, i - chunk_start);
>>> +			if (err)
>>> +				goto out;
>>> +			chunk_start = i;
>>> +			chunk_node = node;
>>>  		}

[...]

>>> +		err = do_move_pages_to_node(mm, &pagelist, chunk_node);
>>> +		if (err)
>>> +			goto out;
>>> +		if (i > chunk_start) {
>>> +			err = store_status(status, chunk_start, chunk_node, i - chunk_start);
>>> +			if (err)
>>> +				goto out;
>>> +		}
>>> +		chunk_node = NUMA_NO_NODE;
>>
>> This block of code is bit confusing.
> 
> I believe this is easier to grasp when looking at the resulting code.
>>
>> 1) Why attempt to migrate when just one page could not be isolated ?
>> 2) 'i' is always greater than chunk_start except the starting page
>> 3) Why reset chunk_node as NUMA_NO_NODE ?
> 
> This is all about flushing the pending state on an error and
> distinguising a fresh batch.

Okay. Will test it out on a multi node system once I get hold of one.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>