On Tue, Oct 26, 2021 at 09:00:37PM -0700, Yang Shi wrote: > On Mon, Oct 25, 2021 at 4:16 PM Naoya Horiguchi <nao.horiguchi@xxxxxxxxx> wrote: > > > > From: Naoya Horiguchi <naoya.horiguchi@xxxxxxx> > > > > After recent soft-offline rework, error pages can be taken off from > > buddy allocator, but the existing unpoison_memory() does not properly > > undo the operation. Moreover, due to the recent change on > > __get_hwpoison_page(), get_page_unless_zero() is hardly called for > > hwpoisoned pages. So __get_hwpoison_page() highly likely returns zero > > (meaning to fail to grab page refcount) and unpoison just clears > > PG_hwpoison without releasing a refcount. That does not lead to a > > critical issue like kernel panic, but unpoisoned pages never get back to > > buddy (leaked permanently), which is not good. > > > > To (partially) fix this, we need to identify "taken off" pages from > > other types of hwpoisoned pages. We can't use refcount or page flags > > for this purpose, so a pseudo flag is defined by hacking ->private > > field. Someone might think that put_page() is enough to cancel > > taken-off pages, but the normal free path contains some operations not > > suitable for the current purpose, and can fire VM_BUG_ON(). > > > > Note that unpoison_memory() is now supposed to be cancel hwpoison events > > injected only by madvise() or /sys/devices/system/memory/{hard,soft}_offline_page, > > not by MCE injection, so please don't try to use unpoison when testing > > with MCE injection. > > > > Signed-off-by: Naoya Horiguchi <naoya.horiguchi@xxxxxxx> > > --- > > ChangeLog v2: > > - unpoison_memory() returns as commented > > - explicitly avoids unpoisoning slab pages > > - separates internal pinning function into __get_unpoison_page() > > --- ... > > @@ -1262,6 +1278,27 @@ static int get_any_page(struct page *p, unsigned long flags) > > return ret; > > } > > > > +static int __get_unpoison_page(struct page *page) > > +{ > > + struct page *head = compound_head(page); > > + int ret = 0; > > + bool hugetlb = false; > > + > > + ret = get_hwpoison_huge_page(head, &hugetlb); > > + if (hugetlb) > > + return ret; > > + > > + /* > > + * PageHWPoisonTakenOff pages are not only marked as PG_hwpoison, > > + * but also isolated from buddy freelist, so need to identify the > > + * state and have to cancel both operations to unpoison. > > + */ > > + if (PageHWPoisonTakenOff(head)) > > + return -EHWPOISON; > > I don't think we could see compound page here, but checking head page > might be confusing since private is per subpage, so it might be better > to use page instead of head. OK, I'll do this (and pass page to get_page_unless_zero() too). > > + > > + return get_page_unless_zero(head) ? 1 : 0; > > +} > > + > > /** > > * get_hwpoison_page() - Get refcount for memory error handling > > * @p: Raw error page (hit by memory error) > > @@ -1278,18 +1315,26 @@ static int get_any_page(struct page *p, unsigned long flags) > > * extra care for the error page's state (as done in __get_hwpoison_page()), > > * and has some retry logic in get_any_page(). > > * > > + * When called from unpoison_memory(), the caller should already ensure that > > + * the given page has PG_hwpoison. So it's never reused for other page > > + * allocations, and __get_unpoison_page() never races with them. > > + * > > * Return: 0 on failure, > > * 1 on success for in-use pages in a well-defined state, > > * -EIO for pages on which we can not handle memory errors, > > * -EBUSY when get_hwpoison_page() has raced with page lifecycle > > - * operations like allocation and free. > > + * operations like allocation and free, > > + * -EHWPOISON when the page is hwpoisoned and taken off from buddy. > > */ > > static int get_hwpoison_page(struct page *p, unsigned long flags) > > { > > int ret; > > > > zone_pcp_disable(page_zone(p)); > > - ret = get_any_page(p, flags); > > + if (flags & MF_UNPOISON) > > + ret = __get_unpoison_page(p); > > + else > > + ret = get_any_page(p, flags); > > zone_pcp_enable(page_zone(p)); > > > > return ret; > > @@ -1942,6 +1987,26 @@ core_initcall(memory_failure_init); > > pr_info(fmt, pfn); \ > > }) > > > > +static inline int clear_page_hwpoison(struct ratelimit_state *rs, struct page *p) > > +{ > > + if (TestClearPageHWPoison(p)) { > > + unpoison_pr_info("Unpoison: Software-unpoisoned page %#lx\n", > > + page_to_pfn(p), rs); > > + num_poisoned_pages_dec(); > > + return 0; > > + } > > + return -EBUSY; > > I don't quite get why -EBUSY is returned. TestClear returns the old > value, so returning 0 means the flag was cleared before. It is fine, > right? I don't see why we have to return different values. I think clear_page_hwpoison()'s return value is used in the path where get_hwpoison_page(MF_UNPOISON) returns 1. But as you mentioned -EBUSY might not be a good return value. And I noticed that freeit's semantics is wrongly reversed, that's just a bug, so I'll fix it. > > > +} > > + > > +static inline int unpoison_taken_off_page(struct ratelimit_state *rs, > > + struct page *p) > > +{ > > + if (take_page_back_buddy(p) && !clear_page_hwpoison(rs, p)) > > If clear_page_hwpoison() is void, it can be moved into take_page_back_buddy(). > > > + return 0; > > + else > > + return -EBUSY; > > +} > > + > > /** > > * unpoison_memory - Unpoison a previously poisoned page > > * @pfn: Page number of the to be unpoisoned page > > @@ -1958,9 +2023,7 @@ int unpoison_memory(unsigned long pfn) > > { > > struct page *page; > > struct page *p; > > - int freeit = 0; > > - int ret = 0; > > - unsigned long flags = 0; > > + int ret = -EBUSY; > > static DEFINE_RATELIMIT_STATE(unpoison_rs, DEFAULT_RATELIMIT_INTERVAL, > > DEFAULT_RATELIMIT_BURST); > > > > @@ -1996,24 +2059,27 @@ int unpoison_memory(unsigned long pfn) > > goto unlock_mutex; > > } > > > > - if (!get_hwpoison_page(p, flags)) { > > - if (TestClearPageHWPoison(p)) > > - num_poisoned_pages_dec(); > > - unpoison_pr_info("Unpoison: Software-unpoisoned free page %#lx\n", > > - pfn, &unpoison_rs); > > + if (PageSlab(page)) > > goto unlock_mutex; > > - } > > > > - if (TestClearPageHWPoison(page)) { > > - unpoison_pr_info("Unpoison: Software-unpoisoned page %#lx\n", > > - pfn, &unpoison_rs); > > - num_poisoned_pages_dec(); > > - freeit = 1; > > - } > > + ret = get_hwpoison_page(p, MF_UNPOISON); > > + if (!ret) { > > + ret = clear_page_hwpoison(&unpoison_rs, p); > > + } else if (ret < 0) { > > + if (ret == -EHWPOISON) { > > + ret = unpoison_taken_off_page(&unpoison_rs, p); > > + } else > > + unpoison_pr_info("Unpoison: failed to grab page %#lx\n", > > + pfn, &unpoison_rs); > > + } else { > > + int freeit = clear_page_hwpoison(&unpoison_rs, p); > > > > - put_page(page); > > - if (freeit && !(pfn == my_zero_pfn(0) && page_count(p) == 1)) > > put_page(page); > > + if (freeit && !(pfn == my_zero_pfn(0) && page_count(p) == 1)) { > > + put_page(page); > > + ret = 0; > > + } > > + } > > > > unlock_mutex: > > mutex_unlock(&mf_mutex); > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > > index 4ea590646f89..b6e4cbb44c54 100644 > > --- a/mm/page_alloc.c > > +++ b/mm/page_alloc.c > > @@ -9466,6 +9466,7 @@ bool take_page_off_buddy(struct page *page) > > del_page_from_free_list(page_head, zone, page_order); > > break_down_buddy_pages(zone, page_head, page, 0, > > page_order, migratetype); > > + SetPageHWPoisonTakenOff(page); > > if (!is_migrate_isolate(migratetype)) > > __mod_zone_freepage_state(zone, -1, migratetype); > > ret = true; > > @@ -9477,4 +9478,26 @@ bool take_page_off_buddy(struct page *page) > > spin_unlock_irqrestore(&zone->lock, flags); > > return ret; > > } > > + > > +/* > > + * Cancel takeoff done by take_page_off_buddy(). > > + */ > > +bool take_page_back_buddy(struct page *page) > > put_page_back_buddy() sounds more natural? Thanks for the suggestion, sounds nice, so I'll rename it. Thanks, Naoya Horiguchi