On Tue, Jun 15, 2021 at 08:42:23PM +0800, Miaohe Lin wrote: ... > > @@ -1960,7 +1964,7 @@ int unpoison_memory(unsigned long pfn) > > if (!PageHuge(page) && PageTransHuge(page)) { > > unpoison_pr_info("Unpoison: Memory failure is now running on %#lx\n", > > pfn, &unpoison_rs); > > - return 0; > > + goto unlock_mutex; > > } > > > > Maybe it's more appropriate to start mutex_lock(&mf_mutex) here? I think these races start here. Hi Miaohe, Thank your for the review. Consider that we put mutex_lock() here, and let's think about two concurrent calls of unpoison_memory(), then these events could be processed like below: CPU 0 CPU 1 unpoison_memory check PageHWPoison // true unpoison_memory check PageHWPoison // true mutex_lock get_hwpoison_page TestClearPageHWPoison put_page put_page // freed mutex_unlock // the unpoisoned page can be used for allocation mutex_lock get_hwpoison_page // succeeds ... // unpoison the !PageHWPoison page !? So I thought that we had better do the prechecks in mf_mutex. Maybe the 2nd unpoison_memory() just get and put the page refcount by 1 even in this race, so the impact is not so big, but I feel like avoiding "unpoison the !PageHWPoison page" situation. Does it make sense for you? Thanks, Naoya Horiguchi