On 2022/3/8 14:56, HORIGUCHI NAOYA(堀口 直也) wrote: > On Mon, Mar 07, 2022 at 11:07:32AM -0800, Mike Kravetz wrote: > ... >>> >>>> + >>>> + /** >>>> + * The page could have changed compound pages due to race window. >>>> + * If this happens just bail out. >>>> + */ >>>> + if (!PageHuge(p) || compound_head(p) != head) { >>>> + action_result(pfn, MF_MSG_DIFFERENT_COMPOUND, MF_IGNORED); >>>> + res = -EBUSY; >>>> + goto out; >>>> + } >>> >>> Let me have one comment on the diff. The result code MF_MSG_DIFFERENT_COMPOUND >>> might not fit when PageHuge is false in the check (because it's no longer a >>> compound page). Maybe you may invent another result code, or changes >>> MF_MSG_DIFFERENT_COMPOUND (for example) to MF_MSG_DIFFERENT_PAGE_SIZE? >>> >> >> Suppose we do encounter this race. Also, suppose p != head. >> At the beginning of memory_failure_hugetlb, we do: >> >> struct page *head = compound_head(p); >> ... >> if (TestSetPageHWPoison(head)) >> >> So, it could be that we set Poison in the 'head' page but the error was really >> in another page. Is that correct? >> >> Now with the race, head is not a huge page and the pages could even be on >> buddy. Does this mean we could have poison set on the wrong page in buddy? > > Correct, the race might be rare, but this needs a fix. > I think that setting PageHWPoison first (before taking refcount and page lock) > is the root of all related problems. This behavior came from the original > concept in hwpoison that preventing consumption of corrupted data is the first > priority. But now I think that this makes no sense if we have this kind of bugs. > > I'll try to write a patch for this (I only fix memory_failure_hugetlb() first, > but generic path should be fixed later). > Thank you for pointing out. Many thanks for both of you for doing this. :) > > - Naoya Horiguchi >