Re: [RFC] a question about reuse hwpoison page in soft_offline_page()

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Naoya, Xishi,

We have a similar problem.
@zhangfei, could you please describe your problem here.

On 2018/7/6 16:18, Naoya Horiguchi wrote:
> On Fri, Jul 06, 2018 at 11:37:41AM +0800, 裘稀石(稀石) wrote:
>> This patch add05cec
>> (mm: soft-offline: don't free target page in successful page migration) removes
>> set_migratetype_isolate() and unset_migratetype_isolate() in soft_offline_page
>> ().
>>
>> And this patch 243abd5b
>> (mm: hugetlb: prevent reuse of hwpoisoned free hugepages) changes
>> if (!is_migrate_isolate_page(page)) to if (!PageHWPoison(page)), so it could
>> prevent someone
>> reuse the free hugetlb again after set the hwpoison flag
>> in soft_offline_free_page()
>>
>> My question is that if someone reuse the free hugetlb again before 
>> soft_offline_free_page() and
>> after get_any_page(), then it uses the hopoison page, and this may trigger mce
>> kill later, right?
> 
> Hi Xishi,
> 
> Thank you for pointing out the issue. That's nice catch.
> 
> I think that the race condition itself could happen, but it doesn't lead
> to MCE kill because PageHWPoison is not visible to HW which triggers MCE.
> PageHWPoison flag is just a flag in struct page to report the memory error
> from kernel to userspace. So even if a CPU is accessing to the page whose
> struct page has PageHWPoison set, that doesn't cause a MCE unless the page
> is physically broken.
> The type of memory error that soft offline tries to handle is corrected
> one which is not a failure yet although it's starting to wear.
> So such PageHWPoison page can be reused, but that's not critical because
> the page is freed at some point afterword and error containment completes.
> 
> However, I noticed that there's a small pain in free hugetlb case.
> We call dissolve_free_huge_page() in soft_offline_free_page() which moves
> the PageHWPoison flag from the head page to the raw error page.
> If the reported race happens, dissolve_free_huge_page() just return without
> doing any dissolve work because "if (PageHuge(page) && !page_count(page))"
> block is skipped.
> The hugepage is allocated and used as usual, but the contaiment doesn't
> complete as expected in the normal page, because free_huge_pages() doesn't
> call dissolve_free_huge_page() for hwpoison hugepage. This is not critical
> because such error hugepage just reside in free hugepage list. But this
> might looks like a kind of memory leak. And even worse when hugepage pool
> is shrinked and the hwpoison hugepage is freed, the PageHWPoison flag is
> still on the head page which is unlikely to be an actual error page.
> 
> So I think we need improvement here, how about the fix like below?
> 
>   (not tested yet, sorry)
> 
>   diff --git a/mm/memory-failure.c b/mm/memory-failure.c
>   --- a/mm/memory-failure.c
>   +++ b/mm/memory-failure.c
>   @@ -1883,6 +1883,11 @@ static void soft_offline_free_page(struct page *page)
>           struct page *head = compound_head(page);
>   
>           if (!TestSetPageHWPoison(head)) {
>   +               if (page_count(head)) {
>   +                       ClearPageHWPoison(head);
>   +                       return;
>   +               }
>   +
>                   num_poisoned_pages_inc();
>                   if (PageHuge(head))
>                           dissolve_free_huge_page(page);
> 
> Thanks,
> Naoya Horiguchi
> 
> .
> 

-- 
Thanks,
Xie XiuQi




[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux