Re: [RFC PATCH v9 12/13] xpfo, mm: Defer TLB flushes for non-current CPUs (x86 only)

Andy Lutomirski <luto@xxxxxxxxxxxxxx> · Fri, 5 Apr 2019 09:24:01 -0600

>>> On Apr 4, 2019, at 4:55 PM, Khalid Aziz <khalid.aziz@xxxxxxxxxx> wrote:
>>> 
>>> On 4/3/19 10:10 PM, Andy Lutomirski wrote:
>>> On Wed, Apr 3, 2019 at 10:36 AM Khalid Aziz <khalid.aziz@xxxxxxxxxx> wrote:
>>> 
>>> XPFO flushes kernel space TLB entries for pages that are now mapped
>>> in userspace on not only the current CPU but also all other CPUs
>>> synchronously. Processes on each core allocating pages causes a
>>> flood of IPI messages to all other cores to flush TLB entries.
>>> Many of these messages are to flush the entire TLB on the core if
>>> the number of entries being flushed from local core exceeds
>>> tlb_single_page_flush_ceiling. The cost of TLB flush caused by
>>> unmapping pages from physmap goes up dramatically on machines with
>>> high core count.
>>> 
>>> This patch flushes relevant TLB entries for current process or
>>> entire TLB depending upon number of entries for the current CPU
>>> and posts a pending TLB flush on all other CPUs when a page is
>>> unmapped from kernel space and mapped in userspace. Each core
>>> checks the pending TLB flush flag for itself on every context
>>> switch, flushes its TLB if the flag is set and clears it.
>>> This patch potentially aggregates multiple TLB flushes into one.
>>> This has very significant impact especially on machines with large
>>> core counts.
>> 
>> Why is this a reasonable strategy?
> 
> Ideally when pages are unmapped from physmap, all CPUs would be sent IPI
> synchronously to flush TLB entry for those pages immediately. This may
> be ideal from correctness and consistency point of view, but it also
> results in IPI storm and repeated TLB flushes on all processors. Any
> time a page is allocated to userspace, we are going to go through this
> and it is very expensive. On a 96-core server, performance degradation
> is 26x!!

Indeed. XPFO is expensive.

> 
> When xpfo unmaps a page from physmap only (after mapping the page in
> userspace in response to an allocation request from userspace) on one
> processor, there is a small window of opportunity for ret2dir attack on
> other cpus until the TLB entry in physmap for the unmapped pages on
> other cpus is cleared.

Why do you think this window is small? Intervals of seconds to months between context switches aren’t unheard of.

And why is a small window like this even helpful?  For a ret2dir attack, you just need to get CPU A to allocate a page and write the ret2dir payload and then get CPU B to return to it before context switching.  This should be doable quite reliably.

So I don’t really have a suggestion, but I think that a 44% regression to get a weak defense like this doesn’t seem worthwhile.  I bet that any of a number of CFI techniques (RAP-like or otherwise) will be cheaper and protect against ret2dir better.  And they’ll also protect against using other kernel memory as a stack buffer.  There are plenty of those — think pipe buffers, network buffers, any page cache not covered by XPFO, XMM/YMM saved state, etc.