On Thu, Feb 25, 2021 at 10:15:42AM -0800, Luck, Tony wrote: > On Thu, Feb 25, 2021 at 12:38:06PM +0000, HORIGUCHI NAOYA(堀口 直也) wrote: > > Thank you for shedding light on this, this race looks worrisome to me. > > We call try_to_unmap() inside memory_failure(), where we find affected > > ptes by page_vma_mapped_walk() and convert into hwpoison entires in > > try_to_unmap_one(). So there seems two racy cases: > > > > 1) > > CPU 0 CPU 1 > > page_vma_mapped_walk > > clear _PAGE_PRESENT bit > > // skipped the entry > > > > 2) > > CPU 0 CPU 1 > > page_vma_mapped_walk > > try_to_unmap_one > > clear _PAGE_PRESENT bit > > convert the entry > > set_pte_at > > > > In case 1, the affected processes get signals on later access, > > so although the info in SIGBUS could be different, that's OK. > > And we have no impact in case 2. > > I've been debugging a similar issue for a few days and finally > got enough traces to partially understand what happened. > > The test case is a multi-threaded pointer chasing micro-benchmark > running on all logical CPUs. We then inject poison into the address > space of the process. > > All works fine if one thread consumes poison and completes all > Linux machine check processing before any other threads read the > poison. The page is unmapped, a SIGBUS is sent (which kills all > threads). > > But in the problem case I see: Thanks for the description, it's helpful to understand the problem. > > CPU1 reads poison, takes a machine check. Gets to the > kill_me_maybe() task work, which calls memory_failure() > this CPU sets the page poison flag, but is still executing the > rest of the flow to hunt down tasks/mappings to invalidate pages > and send SIGBUS if required. > > CPU2 reads the poison. When it gets to memory_failure() > there's an early return because the poison flag is already > set. So in current code it returns and takes the machine > check again. > > CPU3 reads the poison and starts along same path that CPU2 > did. I think that the MCE loop happening on CPU2 and CPU3 is unexpected and these threads should immediately kill the current process on each CPU. force_sig_mceerr() in kill_me_maybe() is supposed to do it, so Aili's patch would fix this issue too? > > Meanwhile CPU1 gets far enough along in memory failure and hits > a problem. It prints: > > [ 1867.409837] Memory failure: 0x42a9ff6: reserved kernel page still referenced by 1 users > [ 1867.409850] Memory failure: 0x42a9ff6: recovery action for reserved kernel page: Failed > > and doesn't complete unmapping the page that CPU2 and CPU3 are touching. > > Other CPUs gradually reach the poison and join in the fun of repeatedly > taking machine checks. > > I have not yet tracked why this user access is reporting as a "reserved kernel page". > Some traces showed that futex(2) syscall was in use from this benchmark, > so maybe the kernel locked a user page that was a contended futex??? This might imply that current logic to identify page state does not work properly on this exotic type of user page, I'll take a look on this from futex's viewpoint. > > Idea for what we should do next ... Now that x86 is calling memory_failure() > from user context ... maybe parallel calls for the same page should > be blocked until the first caller completes so we can: > a) know that pages are unmapped (if that happens) > b) all get the same success/fail status One memory_failure() call changes the target page's status and affects all mappings to all affected processes, so I think that (ideally) we don't have to block other threads (letting them early return seems fine). Sometimes memory_failure() fails, but even in such case, PG_hwpoison is set on the page and other threads properly get SIGBUSs with this patch, so I think that we can avoid the worst scenario (like system stall by MCE loop). Thanks, Naoya Horiguchi