On Fri, Jul 13, 2018 at 01:35:45PM -0700, Andrew Morton wrote: > On Fri, 13 Jul 2018 12:26:05 +0900 Naoya Horiguchi <n-horiguchi@xxxxxxxxxxxxx> wrote: > > > There's a race condition between soft offline and hugetlb_fault which > > causes unexpected process killing and/or hugetlb allocation failure. > > > > The process killing is caused by the following flow: > > > > CPU 0 CPU 1 CPU 2 > > > > soft offline > > get_any_page > > // find the hugetlb is free > > mmap a hugetlb file > > page fault > > ... > > hugetlb_fault > > hugetlb_no_page > > alloc_huge_page > > // succeed > > soft_offline_free_page > > // set hwpoison flag > > mmap the hugetlb file > > page fault > > ... > > hugetlb_fault > > hugetlb_no_page > > find_lock_page > > return VM_FAULT_HWPOISON > > mm_fault_error > > do_sigbus > > // kill the process > > > > > > The hugetlb allocation failure comes from the following flow: > > > > CPU 0 CPU 1 > > > > mmap a hugetlb file > > // reserve all free page but don't fault-in > > soft offline > > get_any_page > > // find the hugetlb is free > > soft_offline_free_page > > // set hwpoison flag > > dissolve_free_huge_page > > // fail because all free hugepages are reserved > > page fault > > ... > > hugetlb_fault > > hugetlb_no_page > > alloc_huge_page > > ... > > dequeue_huge_page_node_exact > > // ignore hwpoisoned hugepage > > // and finally fail due to no-mem > > > > The root cause of this is that current soft-offline code is written > > based on an assumption that PageHWPoison flag should beset at first to > > avoid accessing the corrupted data. This makes sense for memory_failure() > > or hard offline, but does not for soft offline because soft offline is > > about corrected (not uncorrected) error and is safe from data lost. > > This patch changes soft offline semantics where it sets PageHWPoison flag > > only after containment of the error page completes successfully. > > > > ... > > > > --- v4.18-rc4-mmotm-2018-07-10-16-50/mm/memory-failure.c > > +++ v4.18-rc4-mmotm-2018-07-10-16-50_patched/mm/memory-failure.c > > @@ -1598,8 +1598,18 @@ static int soft_offline_huge_page(struct page *page, int flags) > > if (ret > 0) > > ret = -EIO; > > } else { > > - if (PageHuge(page)) > > - dissolve_free_huge_page(page); > > + /* > > + * We set PG_hwpoison only when the migration source hugepage > > + * was successfully dissolved, because otherwise hwpoisoned > > + * hugepage remains on free hugepage list, then userspace will > > + * find it as SIGBUS by allocation failure. That's not expected > > + * in soft-offlining. > > + */ > > This comment is unclear. What happens if there's a hwpoisoned page on > the freelist? The allocator just skips it and looks for another page? Yes, this is what the allocator does. > Or does the allocator return the poisoned page, it gets mapped and > userspace gets a SIGBUS when accessing it? If the latter (or the > former!), why does the comment mention allocation failure? The mention to allocation failure was unclear, I'd like to replace with below, is it clearer? + /* + * We set PG_hwpoison only when the migration source hugepage + * was successfully dissolved, because otherwise hwpoisoned + * hugepage remains on free hugepage list. The allocator ignores + * such a hwpoisoned page so it's never allocated, but it could + * kill a process because of no-memory rather than hwpoison. + * Soft-offline never impacts the userspace, so this is undesired. + */ Thanks, Naoya Horiguchi