Mel Gorman <mgorman@xxxxxxx> wrote: > On Wed, Jul 19, 2017 at 01:20:01PM -0700, Nadav Amit wrote: >>> From a PTE you cannot know the state of mmap_sem because you can rmap >>> back to multiple mm's for shared mappings. It's also fairly heavy handed. >>> Technically, you could lock on the basis of the VMA but that has other >>> consequences for scalability. The staleness is also a factor because >>> it's a case of "does the staleness matter". Sometimes it does, sometimes >>> it doesn't. mmap_sem even if it could be used does not always tell us >>> the right information either because it can matter whether we are racing >>> against a userspace reference or a kernel operation. >>> >>> It's possible your idea could be made work, but right now I'm not seeing a >>> solution that handles every corner case. I asked to hear what your ideas >>> were because anything I thought of that could batch TLB flushing in the >>> general case had flaws that did not improve over what is already there. >> >> I don???t disagree with what you say - perhaps my scheme is too simplistic. >> But the bottom line, if you cannot form simple rules for when TLB needs to >> be flushed, what are the chances others would get it right? > > Broad rule is "flush before the page is freed/reallocated for clean pages > or any IO is initiated for dirty pages" with a lot of details that are not > documented. Often it's the PTL and flush with it held that protects the > majority of cases but it's not universal as the page lock and mmap_sem > play important rules depending ont the context and AFAIK, that's also > not documented. > >>> shrink_page_list is the caller of try_to_unmap in reclaim context. It >>> has this check >>> >>> if (!trylock_page(page)) >>> goto keep; >>> >>> For pages it cannot lock, they get put back on the LRU and recycled instead >>> of reclaimed. Hence, if KSM or anything else holds the page lock, reclaim >>> can't unmap it. >> >> Yes, of course, since KSM does not batch TLB flushes. I regarded the other >> direction - first try_to_unmap() removes the PTE (but still does not flush), >> unlocks the page, and then KSM acquires the page lock and calls >> write_protect_page(). It finds out the PTE is not present and does not flush >> the TLB. > > When KSM acquires the page lock, it then acquires the PTL where the > cleared PTE is observed directly and skipped. I don’t see why. Let’s try again - CPU0 reclaims while CPU1 deduplicates: CPU0 CPU1 ---- ---- shrink_page_list() => try_to_unmap() ==> try_to_unmap_one() [ unmaps from some page-tables ] [ try_to_unmap returns false; page not reclaimed ] => keep_locked: unlock_page() [ TLB flush deferred ] try_to_merge_one_page() => trylock_page() => write_protect_page() ==> acquire ptl [ PTE non-present —> no PTE change and no flush ] ==> release ptl ==> replace_page() At this point, while replace_page() is running, CPU0 may still not have flushed the TLBs. Another CPU (CPU2) may hold a stale PTE, which is not write-protected. It can therefore write to that page while replace_page() is running, resulting in memory corruption. No? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href