On Mon, Aug 26, 2019 at 12:41:50PM +0200, Oscar Salvador wrote: > Hi, > > When analyzing a problem reported by one of our customers, I stumbbled upon an issue > that origins from the fact that poisoned pages end up in the buddy allocator. > > Let me break down the stepts that lie to the problem: > > 1) We soft-offline a page > 2) Page gets flagged as HWPoison and is being sent to the buddy allocator. > This is done through set_hwpoison_free_buddy_page(). > 3) Kcompactd wakes up in order to perform some compaction. > 4) compact_zone() will call migrate_pages() > 5) migrate_pages() will try to get a new page from compaction_alloc() to migrate to > 6) if cc->freelist is empty, compaction_alloc() will call isolate_free_pagesblock() > 7) isolate_free_pagesblock only checks for PageBuddy() to assume that a page is OK > to be used to migrate to. Since HWPoisoned page are also PageBuddy, we add > the page to the list. (same problem exists in fast_isolate_freepages()). > > The outcome of that is that we end up happily handing poisoned pages in compaction_alloc, > so if we ever got a fault on that page through *_fault, we will return VM_FAULT_HWPOISON, > and the process will be killed. > > I first though that I could get away with it by checking PageHWPoison in > {fast_isolate_freepages/isolate_free_pagesblock}, but not really. > It might be that the page we are checking is an order > 0 page, so the first page > might not be poisoned, but the one the follows might be, and we end up in the > same situation. Yes, this is a whole point of the current implementation. > > After some more thought, I really came to the conclusion that HWPoison pages should not > really be in the buddy allocator, as this is only asking for problems. > In this case it is only compaction code, but it could be happening somewhere else, > and one would expect that the pages you got from the buddy allocator are __ready__ to use. > > I __think__ that we thought we were safe to put HWPoison pages in the buddy allocator as we > perform healthy checks when getting a page from there, so we skip poisoned pages > > Of course, this is not the end of the story, now that someone got a page, if he frees it, > there is a high chance that this page ends up in a pcplist (I saw that). > Unless we are on CONFIG_VM_DEBUG, we do not check for the health of pages got from pcplist, > as we do when getting a page from the buddy allocator. > > I checked [1], and it seems that [2] was going towards fixing this kind of issue. > > I think it is about time to revamp the whole thing. > > @Naoya: I could give it a try if you are busy. Thanks for raising hand. That's really wonderful. I think that the series [1] is not merge yet but not rejected yet, so feel free to reuse/update/revamp it. > > [1] https://lore.kernel.org/linux-mm/1541746035-13408-1-git-send-email-n-horiguchi@xxxxxxxxxxxxx/ > [2] https://lore.kernel.org/linux-mm/1541746035-13408-9-git-send-email-n-horiguchi@xxxxxxxxxxxxx/ > > -- > Oscar Salvador > SUSE L3 >