Re: TLB flushes on fixmap changes

Nadav Amit <nadav.amit@xxxxxxxxx> · Sun, 26 Aug 2018 20:26:09 -0700

at 8:03 PM, Masami Hiramatsu <mhiramat@xxxxxxxxxx> wrote:

> On Sun, 26 Aug 2018 11:09:58 +0200
> Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
> 
>> On Sat, Aug 25, 2018 at 09:21:22PM -0700, Andy Lutomirski wrote:
>>> I just re-read text_poke().  It's, um, horrible.  Not only is the
>>> implementation overcomplicated and probably buggy, but it's SLOOOOOW.
>>> It's totally the wrong API -- poking one instruction at a time
>>> basically can't be efficient on x86.  The API should either poke lots
>>> of instructions at once or should be text_poke_begin(); ...;
>>> text_poke_end();.
>> 
>> I don't think anybody ever cared about performance here. Only
>> correctness. That whole text_poke_bp() thing is entirely tricky.
> 
> Agreed. Self modification is a special event.
> 
>> FWIW, before text_poke_bp(), text_poke() would only be used from
>> stop_machine, so all the other CPUs would be stuck busy-waiting with
>> IRQs disabled. These days, yeah, that's lots more dodgy, but yes
>> text_mutex should be serializing all that.
> 
> I'm still not sure that speculative page-table walk can be done
> over the mutex. Also, if the fixmap area is for aliasing
> pages (which always mapped to memory), what kind of
> security issue can happen?

The PTE is accessible from other cores, so just as we assume for L1TF that
the every addressable memory might be cached in L1, we should assume and
PTE might be cached in the TLB when it is present.

Although the mapping is for an alias, there are a couple of issues here.
First, this alias mapping is writable, so it might an attacker to change the
kernel code (following another initial attack). Second, the alias mapping is
never explicitly flushed. We may assume that once the original mapping is
removed/changed, a full TLB flush would take place, but there is no
guarantee it actually takes place.

> Anyway, from the viewpoint of kprobes, either per-cpu fixmap or
> changing CR3 sounds good to me. I think we don't even need per-cpu,
> it can call a thread/function on a dedicated core (like the first
> boot processor) and wait :) This may prevent leakage of pte change
> to other cores.

I implemented per-cpu fixmap, but I think that it makes more sense to take
peterz approach and set an entry in the PGD level. Per-CPU fixmap either
requires to pre-populate various levels in the page-table hierarchy, or
conditionally synchronize whenever module memory is allocated, since they
can share the same PGD, PUD & PMD. While usually the synchronization is not
needed, the possibility that synchronization is needed complicates locking.

Anyhow, having fixed addresses for the fixmap can be used to circumvent
KASLR.

I don’t think a dedicated core is needed. Anyhow there is a lock
(text_mutex), so use_mm() can be used after acquiring the mutex.