Re: Potential race in TLB flush batching?

Nadav Amit <nadav.amit@xxxxxxxxx> · Wed, 19 Jul 2017 12:41:01 -0700

Mel Gorman <mgorman@xxxxxxx> wrote:

> On Tue, Jul 18, 2017 at 02:28:27PM -0700, Nadav Amit wrote:
>>> If there are separate address spaces using a shared mapping then the
>>> same race does not occur.
>> 
>> I missed the fact you reverted the two operations since the previous version
>> of the patch. This specific scenario should be solved with this patch.
>> 
>> But in general, I think there is a need for a simple locking scheme.
> 
> Such as?

Something like:

bool is_potentially_stale_pte(pte_t pte, pgprot_t prot, int lock_state);

which would get the current PTE, the protection bits that the user is
interested in, and whether mmap_sem is taken read/write/none. 

It would return whether this PTE may be potentially stale and needs to be
invalidated. Obviously, any code that removes protection or unmaps need to
be updated for this information to be correct.

[snip]

>> As a result, concurrent operation such as KSM???s write_protect_page() or
> 
> write_protect_page operates under the page lock and cannot race with reclaim.

I still do not understand this claim. IIUC, reclaim can unmap the page in
some page table, decide not to reclaim the page and release the page-lock
before flush.

>> page_mkclean_one() can consider the page write-protected while in fact it is
>> still accessible - since the TLB flush was deferred.
> 
> As long as it's flushed before any IO occurs that would lose a data update,
> it's not a data integrity issue.
> 
>> As a result, they may
>> mishandle the PTE without flushing the page. In the case of
>> page_mkclean_one(), I suspect it may even lead to memory corruption. I admit
>> that in x86 there are some mitigating factors that would make such ???attack???
>> complicated, but it still seems wrong to me, no?
> 
> I worry that you're beginning to see races everywhere. I admit that the
> rules and protections here are varied and complex but it's worth keeping
> in mind that data integrity is the key concern (no false reads to wrong
> data, no lost writes) and the first race you identified found some problems
> here. However, with or without batching, there is always a delay between
> when a PTE is cleared and when the TLB entries are removed.

Sure, but usually the delay occurs while the page-table lock is taken so
there is no race.

Now, it is not fair to call me a paranoid, considering that these races are
real - I confirmed that at least two can happen in practice. There are many
possibilities for concurrent TLB batching and you cannot expect developers
to consider all of them. I don’t think many people are capable of doing the
voodoo tricks of avoiding a TLB flush if the page-lock is taken or the VMA
is anonymous. I doubt that these tricks work and anyhow IMHO they are likely
to fail in the future since they are undocumented and complicated.

As for “data integrity is the key concern” - violating the memory management
API can cause data integrity issues for programs. It may not cause the OS to
crash, but it should not be acceptable either, and may potentially raise
security concerns. If you think that the current behavior is ok, let the
documentation and man pages clarify that mprotect may not protect, madvise
may not advise and so on.

And although you would use it against me, I would say: Nobody knew that TLB
flushing could be so complicated.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href