Re: Potential race in TLB flush batching?

Nadav Amit <nadav.amit@xxxxxxxxx> · Tue, 11 Jul 2017 00:30:28 -0700

Mel Gorman <mgorman@xxxxxxx> wrote:

> On Mon, Jul 10, 2017 at 05:52:25PM -0700, Nadav Amit wrote:
>> Something bothers me about the TLB flushes batching mechanism that Linux
>> uses on x86 and I would appreciate your opinion regarding it.
>> 
>> As you know, try_to_unmap_one() can batch TLB invalidations. While doing so,
>> however, the page-table lock(s) are not held, and I see no indication of the
>> pending flush saved (and regarded) in the relevant mm-structs.
>> 
>> So, my question: what prevents, at least in theory, the following scenario:
>> 
>> 	CPU0 				CPU1
>> 	----				----
>> 					user accesses memory using RW PTE 
>> 					[PTE now cached in TLB]
>> 	try_to_unmap_one()
>> 	==> ptep_get_and_clear()
>> 	==> set_tlb_ubc_flush_pending()
>> 					mprotect(addr, PROT_READ)
>> 					==> change_pte_range()
>> 					==> [ PTE non-present - no flush ]
>> 
>> 					user writes using cached RW PTE
>> 	...
>> 
>> 	try_to_unmap_flush()
>> 
>> 
>> As you see CPU1 write should have failed, but may succeed. 
>> 
>> Now I don???t have a PoC since in practice it seems hard to create such a
>> scenario: try_to_unmap_one() is likely to find the PTE accessed and the PTE
>> would not be reclaimed.
> 
> That is the same to a race whereby there is no batching mechanism and the
> racing operation happens between a pte clear and a flush as ptep_clear_flush
> is not atomic. All that differs is that the race window is a different size.
> The application on CPU1 is buggy in that it may or may not succeed the write
> but it is buggy regardless of whether a batching mechanism is used or not.

Thanks for your quick and detailed response, but I fail to see how it can
happen without batching. Indeed, the PTE clear and flush are not “atomic”,
but without batching they are both performed under the page table lock
(which is acquired in page_vma_mapped_walk and released in
page_vma_mapped_walk_done). Since the lock is taken, other cores should not
be able to inspect/modify the PTE. Relevant functions, e.g., zap_pte_range
and change_pte_range, acquire the lock before accessing the PTEs.

Can you please explain why you consider the application to be buggy? AFAIU
an application can wish to trap certain memory accesses using userfaultfd or
SIGSEGV. For example, it may do it for garbage collection or sandboxing. To
do so, it can use mprotect with PROT_NONE and expect to be able to trap
future accesses to that memory. This use-case is described in usefaultfd
documentation.

> The user accessed the PTE before the mprotect so, at the time of mprotect,
> the PTE is either clean or dirty. If it is clean then any subsequent write
> would transition the PTE from clean to dirty and an architecture enabling
> the batching mechanism must trap a clean->dirty transition for unmapped
> entries as commented upon in try_to_unmap_one (and was checked that this
> is true for x86 at least). This avoids data corruption due to a lost update.
> 
> If the previous access was a write then the batching flushes the page if
> any IO is required to avoid any writes after the IO has been initiated
> using try_to_unmap_flush_dirty so again there is no data corruption. There
> is a window where the TLB entry exists after the unmapping but this exists
> regardless of whether we batch or not.
> 
> In either case, before a page is freed and potentially allocated to another
> process, the TLB is flushed.

To clarify my concern again - I am not regarding a memory corruption as you
do, but situations in which the application wishes to trap certain memory
accesses but fails to do so. Having said that, I would add, that even if an
application has a bug, it may expect this bug not to affect memory that was
previously unmapped (and may be written to permanent storage).

Thanks (again),
Nadav

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href