On 8/14/15 4:38 PM, Naoya Horiguchi wrote: > On Fri, Aug 14, 2015 at 03:59:21PM +0800, Wanpeng Li wrote: >> On 8/14/15 3:54 PM, Wanpeng Li wrote: >>> [...] >>>> OK, then I rethink of handling the race in unpoison_memory(). >>>> >>>> Currently properly contained/hwpoisoned pages should have page refcount 1 >>>> (when the memory error hits LRU pages or hugetlb pages) or refcount 0 >>>> (when the memory error hits the buddy page.) And current unpoison_memory() >>>> implicitly assumes this because otherwise the unpoisoned page has no place >>>> to go and it's just leaked. >>>> So to avoid the kernel panic, adding prechecks of refcount and mapcount >>>> to limit the page to unpoison for only unpoisonable pages looks OK to me. >>>> The page under soft offlining always has refcount >=2 and/or mapcount > 0, >>>> so such pages should be filtered out. >>>> >>>> Here's a patch. In my testing (run soft offline stress testing then repeat >>>> unpoisoning in background,) the reported (or similar) bug doesn't happen. >>>> Can I have your comments? >>> As page_action() prints out page maybe still referenced by some users, >>> however, PageHWPoison has already set. So you will leak many poison pages. >>> >> Anyway, the bug is still there. >> >> [ 944.387559] BUG: Bad page state in process expr pfn:591e3 >> [ 944.393053] page:ffffea00016478c0 count:-1 mapcount:0 mapping: >> (null) index:0x2 >> [ 944.401147] flags: 0x1fffff80000000() >> [ 944.404819] page dumped because: nonzero _count > Hmm, no luck :( > > To investigate more, I'd like to test the exactly same kernel as yours, so > could you share the kernel info (.config and base kernel and what patches > you applied)? or pushing your tree somewhere like github? > # if you like, sending to me privately is fine. > > I think that I tested v4.2-rc6 + <your recent 7 hwpoison patches> + > "mm/hwpoison: fix race between soft_offline_page and unpoison_memory", > but I experienced some conflict in applying your patches for some reason, > so it might happen that we are testing on different kernels. I don't have special config and tree, the latest mmotm has already merged my recent 8 hwpoison patches, you can test based on it. Regards, Wanpeng Li > > Mine is here: > https://github.com/Naoya-Horiguchi/linux v4.2-rc6/fix_race_soft_offline_unpoison > > Thanks, > Naoya Horiguchi -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>