Mel Gorman <mgorman@xxxxxxx> wrote: > On Wed, Jul 19, 2017 at 04:39:07PM -0700, Nadav Amit wrote: >>> If try_to_unmap returns false on CPU0 then at least one unmap attempt >>> failed and the page is not reclaimed. >> >> Actually, try_to_unmap() may even return true, and the page would still not >> be reclaimed - for example if page_has_private() and freeing the buffers >> fails. In this case, the page would be unlocked as well. > > I'm not seeing the relevance from the perspective of a stale TLB being > used to corrupt memory or access the wrong data. > >>> For those that were unmapped, they >>> will get flushed in the near future. When KSM operates on CPU1, it'll skip >>> the unmapped pages under the PTL so stale TLB entries are not relevant as >>> the mapped entries are still pointing to a valid page and ksm misses a merge >>> opportunity. >> >> This is the case I regarded, but I do not understand your point. The whole >> problem is that CPU1 would skip the unmapped pages under the PTL. As it >> skips them it does not flush them from the TLB. And as a result, >> replace_page() may happen before the TLB is flushed by CPU0. > > At the time of the unlock_page on the reclaim side, any unmapping that > will happen before the flush has taken place. If KSM starts between the > unlock_page and the tlb flush then it'll skip any of the PTEs that were > previously unmapped with stale entries so there is no relevant stale TLB > entry to work with. I don’t see where this skipping happens, but let’s put this scenario aside for a second. Here is a similar scenario that causes memory corruption. I actually created and tested it (although I needed to hack the kernel to add some artificial latency before the actual flushes and before the actual dedupliaction of KSM). We are going to cause KSM to deduplicate a page, and after page comparison but before the page is actually replaced, to use a stale PTE entry to overwrite the page. As a result KSM will lose a write, causing memory corruption. For this race we need 4 CPUs: CPU0: Caches a writable and dirty PTE entry, and uses the stale value for write later. CPU1: Runs madvise_free on the range that includes the PTE. It would clear the dirty-bit. It batches TLB flushes. CPU2: Writes 4 to /proc/PID/clear_refs , clearing the PTEs soft-dirty. We care about the fact that it clears the PTE write-bit, and of course, batches TLB flushes. CPU3: Runs KSM. Our purpose is to pass the following test in write_protect_page(): if (pte_write(*pvmw.pte) || pte_dirty(*pvmw.pte) || (pte_protnone(*pvmw.pte) && pte_savedwrite(*pvmw.pte))) Since it will avoid TLB flush. And we want to do it while the PTE is stale. Later, and before replacing the page, we would be able to change the page. Note that all the operations the CPU1-3 perform canhappen in parallel since they only acquire mmap_sem for read. We start with two identical pages. Everything below regards the same page/PTE. CPU0 CPU1 CPU2 CPU3 ---- ---- ---- ---- Write the same value on page [cache PTE as dirty in TLB] MADV_FREE pte_mkclean() 4 > clear_refs pte_wrprotect() write_protect_page() [ success, no flush ] pages_indentical() [ ok ] Write to page different value [Ok, using stale PTE] replace_page() Later, CPU1, CPU2 and CPU3 would flush the TLB, but that is too late. CPU0 already wrote on the page, but KSM ignored this write, and it got lost. Now to reiterate my point: It is really hard to get TLB batching right without some clear policy. And it should be important, since such issues can cause memory corruption and have security implications (if somebody manages to get the timing right). Regards, Nadav -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href