Re: Potential race in TLB flush batching?

Nadav Amit <nadav.amit@xxxxxxxxx> · Fri, 21 Jul 2017 18:19:22 -0700

Mel Gorman <mgorman@xxxxxxx> wrote:

> On Wed, Jul 19, 2017 at 04:39:07PM -0700, Nadav Amit wrote:
>>> If try_to_unmap returns false on CPU0 then at least one unmap attempt
>>> failed and the page is not reclaimed.
>> 
>> Actually, try_to_unmap() may even return true, and the page would still not
>> be reclaimed - for example if page_has_private() and freeing the buffers
>> fails. In this case, the page would be unlocked as well.
> 
> I'm not seeing the relevance from the perspective of a stale TLB being
> used to corrupt memory or access the wrong data.
> 
>>> For those that were unmapped, they
>>> will get flushed in the near future. When KSM operates on CPU1, it'll skip
>>> the unmapped pages under the PTL so stale TLB entries are not relevant as
>>> the mapped entries are still pointing to a valid page and ksm misses a merge
>>> opportunity.
>> 
>> This is the case I regarded, but I do not understand your point. The whole
>> problem is that CPU1 would skip the unmapped pages under the PTL. As it
>> skips them it does not flush them from the TLB. And as a result,
>> replace_page() may happen before the TLB is flushed by CPU0.
> 
> At the time of the unlock_page on the reclaim side, any unmapping that
> will happen before the flush has taken place. If KSM starts between the
> unlock_page and the tlb flush then it'll skip any of the PTEs that were
> previously unmapped with stale entries so there is no relevant stale TLB
> entry to work with.

I don’t see where this skipping happens, but let’s put this scenario aside
for a second. Here is a similar scenario that causes memory corruption. I
actually created and tested it (although I needed to hack the kernel to add
some artificial latency before the actual flushes and before the actual
dedupliaction of KSM).

We are going to cause KSM to deduplicate a page, and after page comparison
but before the page is actually replaced, to use a stale PTE entry to 
overwrite the page. As a result KSM will lose a write, causing memory
corruption.

For this race we need 4 CPUs:

CPU0: Caches a writable and dirty PTE entry, and uses the stale value for
write later.

CPU1: Runs madvise_free on the range that includes the PTE. It would clear
the dirty-bit. It batches TLB flushes.

CPU2: Writes 4 to /proc/PID/clear_refs , clearing the PTEs soft-dirty. We
care about the fact that it clears the PTE write-bit, and of course, batches
TLB flushes.

CPU3: Runs KSM. Our purpose is to pass the following test in
write_protect_page():

	if (pte_write(*pvmw.pte) || pte_dirty(*pvmw.pte) ||
	    (pte_protnone(*pvmw.pte) && pte_savedwrite(*pvmw.pte)))

Since it will avoid TLB flush. And we want to do it while the PTE is stale.
Later, and before replacing the page, we would be able to change the page.

Note that all the operations the CPU1-3 perform canhappen in parallel since
they only acquire mmap_sem for read.

We start with two identical pages. Everything below regards the same
page/PTE.

CPU0		CPU1		CPU2		CPU3
----		----		----		----
Write the same
value on page

[cache PTE as
 dirty in TLB]

		MADV_FREE
		pte_mkclean()

				4 > clear_refs
				pte_wrprotect()

						write_protect_page()
						[ success, no flush ]

						pages_indentical()
						[ ok ]

Write to page
different value

[Ok, using stale
 PTE]

						replace_page()

Later, CPU1, CPU2 and CPU3 would flush the TLB, but that is too late. CPU0
already wrote on the page, but KSM ignored this write, and it got lost.

Now to reiterate my point: It is really hard to get TLB batching right
without some clear policy. And it should be important, since such issues can
cause memory corruption and have security implications (if somebody manages
to get the timing right).

Regards,
Nadav
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href