Mel Gorman <mgorman@xxxxxxx> wrote: > On Fri, Jul 14, 2017 at 04:16:44PM -0700, Nadav Amit wrote: >> Mel Gorman <mgorman@xxxxxxx> wrote: >> >>> On Wed, Jul 12, 2017 at 04:27:23PM -0700, Nadav Amit wrote: >>>>> If reclaim is first, it'll take the PTL, set batched while a racing >>>>> mprotect/munmap/etc spins. On release, the racing mprotect/munmmap >>>>> immediately calls flush_tlb_batched_pending() before proceeding as normal, >>>>> finding pte_none with the TLB flushed. >>>> >>>> This is the scenario I regarded in my example. Notice that when the first >>>> flush_tlb_batched_pending is called, CPU0 and CPU1 hold different page-table >>>> locks - allowing them to run concurrently. As a result >>>> flush_tlb_batched_pending is executed before the PTE was cleared and >>>> mm->tlb_flush_batched is cleared. Later, after CPU0 runs ptep_get_and_clear >>>> mm->tlb_flush_batched remains clear, and CPU1 can use the stale PTE. >>> >>> If they hold different PTL locks, it means that reclaim and and the parallel >>> munmap/mprotect/madvise/mremap operation are operating on different regions >>> of an mm or separate mm's and the race should not apply or at the very >>> least is equivalent to not batching the flushes. For multiple parallel >>> operations, munmap/mprotect/mremap are serialised by mmap_sem so there >>> is only one risky operation at a time. For multiple madvise, there is a >>> small window when a page is accessible after madvise returns but it is an >>> advisory call so it's primarily a data integrity concern and the TLB is >>> flushed before the page is either freed or IO starts on the reclaim side. >> >> I think there is some miscommunication. Perhaps one detail was missing: >> >> CPU0 CPU1 >> ---- ---- >> should_defer_flush >> => mm->tlb_flush_batched=true >> flush_tlb_batched_pending (another PT) >> => flush TLB >> => mm->tlb_flush_batched=false >> >> Access PTE (and cache in TLB) >> ptep_get_and_clear(PTE) >> ... >> >> flush_tlb_batched_pending (batched PT) >> [ no flush since tlb_flush_batched=false ] >> use the stale PTE >> ... >> try_to_unmap_flush >> >> There are only 2 CPUs and both regard the same address-space. CPU0 reclaim a >> page from this address-space. Just between setting tlb_flush_batch and the >> actual clearing of the PTE, the process on CPU1 runs munmap and calls >> flush_tlb_batched_pending. This can happen if CPU1 regards a different >> page-table. > > If both regard the same address-space then they have the same page table so > there is a disconnect between the first and last sentence in your paragraph > above. On CPU 0, the setting of tlb_flush_batched and ptep_get_and_clear > is also reversed as the sequence is > > pteval = ptep_get_and_clear(mm, address, pvmw.pte); > set_tlb_ubc_flush_pending(mm, pte_dirty(pteval)); > > Additional barriers should not be needed as within the critical section > that can race, it's protected by the lock and with Andy's code, there is > a full barrier before the setting of tlb_flush_batched. With Andy's code, > there may be a need for a compiler barrier but I can rethink about that > and add it during the backport to -stable if necessary. > > So the setting happens after the clear and if they share the same address > space and collide then they both share the same PTL so are protected from > each other. > > If there are separate address spaces using a shared mapping then the > same race does not occur. I missed the fact you reverted the two operations since the previous version of the patch. This specific scenario should be solved with this patch. But in general, I think there is a need for a simple locking scheme. Otherwise, people (like me) would be afraid to make any changes to the code, and additional missing TLB flushes would exist. For example, I suspect that a user may trigger insert_pfn() or insert_page(), and rely on their output. While it makes little sense, the user can try to insert the page on the same address of another page. If the other page was already reclaimed the operation should succeed and otherwise fail. But it may succeed while the other page is going through reclamation, resulting in: CPU0 CPU1 ---- ---- ptep_clear_flush_notify() - access memory using a PTE [ PTE cached in TLB ] try_to_unmap_one() ==> ptep_get_and_clear() == false insert_page() ==> pte_none() = true [retval = 0] - access memory using a stale PTE Additional potential situations can be caused, IIUC, by mcopy_atomic_pte(), mfill_zeropage_pte(), shmem_mcopy_atomic_pte(). Even more importantly, I suspect there is an additional similar but unrelated problem. clear_refs_write() can be used with CLEAR_REFS_SOFT_DIRTY to write-protect PTEs. However, it batches TLB flushes, while only holding mmap_sem for read, and without any indication in mm that TLB flushes are pending. As a result, concurrent operation such as KSM’s write_protect_page() or page_mkclean_one() can consider the page write-protected while in fact it is still accessible - since the TLB flush was deferred. As a result, they may mishandle the PTE without flushing the page. In the case of page_mkclean_one(), I suspect it may even lead to memory corruption. I admit that in x86 there are some mitigating factors that would make such “attack” complicated, but it still seems wrong to me, no? Thanks, Nadav -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href