Re: [PATCH v3] mm, hugetlbfs: fix rmapping for anonymous hugepages with page_pgoff()

Sasha Levin <sasha.levin@xxxxxxxxxx> · Sat, 01 Mar 2014 18:08:17 -0500

On 02/28/2014 10:35 PM, Naoya Horiguchi wrote:
> On Fri, Feb 28, 2014 at 03:14:27PM -0800, Andrew Morton wrote:
>> On Fri, 28 Feb 2014 14:59:02 -0500 Naoya Horiguchi <n-horiguchi@xxxxxxxxxxxxx> wrote:
>>
>>> page->index stores pagecache index when the page is mapped into file mapping
>>> region, and the index is in pagecache size unit, so it depends on the page
>>> size. Some of users of reverse mapping obviously assumes that page->index
>>> is in PAGE_CACHE_SHIFT unit, so they don't work for anonymous hugepage.
>>>
>>> For example, consider that we have 3-hugepage vma and try to mbind the 2nd
>>> hugepage to migrate to another node. Then the vma is split and migrate_page()
>>> is called for the 2nd hugepage (belonging to the middle vma.)
>>> In migrate operation, rmap_walk_anon() tries to find the relevant vma to
>>> which the target hugepage belongs, but here we miscalculate pgoff.
>>> So anon_vma_interval_tree_foreach() grabs invalid vma, which fires VM_BUG_ON.
>>>
>>> This patch introduces a new API that is usable both for normal page and
>>> hugepage to get PAGE_SIZE offset from page->index. Users should clearly
>>> distinguish page_index for pagecache index and page_pgoff for page offset.
>>>
>>> ..
>>>
>>> --- a/include/linux/pagemap.h
>>> +++ b/include/linux/pagemap.h
>>> @@ -307,6 +307,22 @@ static inline loff_t page_file_offset(struct page *page)
>>>   	return ((loff_t)page_file_index(page)) << PAGE_CACHE_SHIFT;
>>>   }
>>>   
>>> +static inline unsigned int page_size_order(struct page *page)
>>> +{
>>> +	return unlikely(PageHuge(page)) ?
>>> +		huge_page_size_order(page) :
> 
> I found that we have compound_order(page) for the same purpose, so we don't
> have to define this new function.
> 
>>> +		(PAGE_CACHE_SHIFT - PAGE_SHIFT);
>>> +}
>>
>> Could use some nice documentation, please.  Why it exists, what it
>> does.  Particularly: what sort of pages it can and can't operate on,
>> and why.
> 
> OK.
> 
>> The presence of PAGE_CACHE_SIZE is unfortunate - it at least implies
>> that the page is a pagecache page.  I dunno, maybe just use "0"?
> 
> Yes, PAGE_CACHE_SHIFT makes code messy if PAGE_CACHE_SHIFT is always PAGE_SHIFT.
> But I guess that recently people start to thinking of changing the size of
> pagecache (in the discussion around >4kB sector device.)
> And from readabilitie's perspective, "pagecache size" and "page size" are
> different things, so keeping it is better in a long run.
> 
> Anyway, I revised the patch again, could you take a look?
> 
> Thanks,
> Naoya
> ---
> From: Naoya Horiguchi <n-horiguchi@xxxxxxxxxxxxx>
> Date: Fri, 28 Feb 2014 21:56:24 -0500
> Subject: [PATCH] mm, hugetlbfs: fix rmapping for anonymous hugepages with
>   page_pgoff()
> 
> page->index stores pagecache index when the page is mapped into file mapping
> region, and the index is in pagecache size unit, so it depends on the page
> size. Some of users of reverse mapping obviously assumes that page->index
> is in PAGE_CACHE_SHIFT unit, so they don't work for anonymous hugepage.
> 
> For example, consider that we have 3-hugepage vma and try to mbind the 2nd
> hugepage to migrate to another node. Then the vma is split and migrate_page()
> is called for the 2nd hugepage (belonging to the middle vma.)
> In migrate operation, rmap_walk_anon() tries to find the relevant vma to
> which the target hugepage belongs, but here we miscalculate pgoff.
> So anon_vma_interval_tree_foreach() grabs invalid vma, which fires VM_BUG_ON.
> 
> This patch introduces a new API that is usable both for normal page and
> hugepage to get PAGE_SIZE offset from page->index. Users should clearly
> distinguish page_index for pagecache index and page_pgoff for page offset.
> 
> ChangeLog v3:
> - add comment on page_size_order()
> - use compound_order(compound_head(page)) instead of huge_page_size_order()
> - use page_pgoff() in rmap_walk_file() too
> - use page_size_order() in kill_proc()
> - fix space indent
> 
> ChangeLog v2:
> - fix wrong shift direction
> - introduce page_size_order() and huge_page_size_order()
> - move the declaration of PageHuge() to include/linux/hugetlb_inline.h
>    to avoid macro definition.
> 
> Reported-by: Sasha Levin <sasha.levin@xxxxxxxxxx> # if the reported problem is fixed
> Signed-off-by: Naoya Horiguchi <n-horiguchi@xxxxxxxxxxxxx>
> Cc: stable@xxxxxxxxxxxxxxx # 3.12+

I can confirm that with this patch the lockdep issue is gone. However, the NULL deref in
walk_pte_range() and the BUG at mm/hugemem.c:3580 still appear.

Thanks,
Sasha

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>