On 2021/6/16 8:41, HORIGUCHI NAOYA(堀口 直也) wrote: > On Tue, Jun 15, 2021 at 08:42:23PM +0800, Miaohe Lin wrote: > ... >>> @@ -1960,7 +1964,7 @@ int unpoison_memory(unsigned long pfn) >>> if (!PageHuge(page) && PageTransHuge(page)) { >>> unpoison_pr_info("Unpoison: Memory failure is now running on %#lx\n", >>> pfn, &unpoison_rs); >>> - return 0; >>> + goto unlock_mutex; >>> } >>> >> >> Maybe it's more appropriate to start mutex_lock(&mf_mutex) here? I think these races start here. > > Hi Miaohe, > > Thank your for the review. > > Consider that we put mutex_lock() here, and let's think about two concurrent > calls of unpoison_memory(), then these events could be processed like below: > > CPU 0 CPU 1 > unpoison_memory > check PageHWPoison // true > unpoison_memory > check PageHWPoison // true > mutex_lock > get_hwpoison_page > TestClearPageHWPoison > put_page > put_page // freed > mutex_unlock > > // the unpoisoned page can be used for allocation > > mutex_lock > get_hwpoison_page // succeeds > ... // unpoison the !PageHWPoison page !? > > > So I thought that we had better do the prechecks in mf_mutex. Maybe the 2nd > unpoison_memory() just get and put the page refcount by 1 even in this race, > so the impact is not so big, but I feel like avoiding "unpoison the > !PageHWPoison page" situation. > > Does it make sense for you? Yes, I missed the races inside unpoison_memory() itself. Many thanks for your detailed explanation! > > Thanks, > Naoya Horiguchi >