On Tue, Apr 13, 2021 at 07:43:18AM +0900, Naoya Horiguchi wrote: > From: Tony Luck <tony.luck@xxxxxxxxx> > > There can be races when multiple CPUs consume poison from the same > page. The first into memory_failure() atomically sets the HWPoison > page flag and begins hunting for tasks that map this page. Eventually > it invalidates those mappings and may send a SIGBUS to the affected > tasks. > > But while all that work is going on, other CPUs see a "success" > return code from memory_failure() and so they believe the error > has been handled and continue executing. > > Fix by wrapping most of the internal parts of memory_failure() in > a mutex. > > Signed-off-by: Tony Luck <tony.luck@xxxxxxxxx> > Signed-off-by: Naoya Horiguchi <naoya.horiguchi@xxxxxxx> > --- > mm/memory-failure.c | 19 +++++++++++++++++-- > 1 file changed, 17 insertions(+), 2 deletions(-) > > diff --git v5.12-rc5/mm/memory-failure.c v5.12-rc5_patched/mm/memory-failure.c > index 24210c9bd843..c1509f4b565e 100644 > --- v5.12-rc5/mm/memory-failure.c > +++ v5.12-rc5_patched/mm/memory-failure.c > @@ -1381,6 +1381,8 @@ static int memory_failure_dev_pagemap(unsigned long pfn, int flags, > return rc; > } > > +static DEFINE_MUTEX(mf_mutex); > + > /** > * memory_failure - Handle memory failure of a page. > * @pfn: Page Number of the corrupted page > @@ -1424,12 +1426,18 @@ int memory_failure(unsigned long pfn, int flags) > return -ENXIO; > } So the locking patterns are done in two different ways, which are confusing when following this code: > + mutex_lock(&mf_mutex); > + > try_again: > - if (PageHuge(p)) > - return memory_failure_hugetlb(pfn, flags); > + if (PageHuge(p)) { > + res = memory_failure_hugetlb(pfn, flags); > + goto out2; > + } You have the goto to a label where you do the unlocking (btw, pls do s/out2/out_unlock/g;)... > + > if (TestSetPageHWPoison(p)) { > pr_err("Memory failure: %#lx: already hardware poisoned\n", > pfn); > + mutex_unlock(&mf_mutex); > return 0; ... and you have the other case where you unlock before returning. Since you've added the label, I think *all* the unlocking should do "goto out_unlock" instead of doing either/or. Thx. -- Regards/Gruss, Boris. https://people.kernel.org/tglx/notes-about-netiquette