Re: [PATCH 1/2] vmalloc: New flag for flush before releasing pages

Nadav Amit <nadav.amit@xxxxxxxxx> · Tue, 4 Dec 2018 18:09:04 -0800

> On Dec 4, 2018, at 5:45 PM, Edgecombe, Rick P <rick.p.edgecombe@xxxxxxxxx> wrote:
> 
> On Tue, 2018-12-04 at 16:53 -0800, Nadav Amit wrote:
>>> On Dec 4, 2018, at 4:29 PM, Edgecombe, Rick P <rick.p.edgecombe@xxxxxxxxx>
>>> wrote:
>>> 
>>> On Tue, 2018-12-04 at 16:01 -0800, Nadav Amit wrote:
>>>>> On Dec 4, 2018, at 3:51 PM, Edgecombe, Rick P <
>>>>> rick.p.edgecombe@xxxxxxxxx>
>>>>> wrote:
>>>>> 
>>>>> On Tue, 2018-12-04 at 12:36 -0800, Nadav Amit wrote:
>>>>>>> On Dec 4, 2018, at 12:02 PM, Edgecombe, Rick P <
>>>>>>> rick.p.edgecombe@xxxxxxxxx>
>>>>>>> wrote:
>>>>>>> 
>>>>>>> On Tue, 2018-12-04 at 16:03 +0000, Will Deacon wrote:
>>>>>>>> On Mon, Dec 03, 2018 at 05:43:11PM -0800, Nadav Amit wrote:
>>>>>>>>>> On Nov 27, 2018, at 4:07 PM, Rick Edgecombe <
>>>>>>>>>> rick.p.edgecombe@xxxxxxxxx>
>>>>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>> Since vfree will lazily flush the TLB, but not lazily free the
>>>>>>>>>> underlying
>>>>>>>>>> pages,
>>>>>>>>>> it often leaves stale TLB entries to freed pages that could
>>>>>>>>>> get
>>>>>>>>>> re-
>>>>>>>>>> used.
>>>>>>>>>> This is
>>>>>>>>>> undesirable for cases where the memory being freed has special
>>>>>>>>>> permissions
>>>>>>>>>> such
>>>>>>>>>> as executable.
>>>>>>>>> 
>>>>>>>>> So I am trying to finish my patch-set for preventing transient
>>>>>>>>> W+X
>>>>>>>>> mappings
>>>>>>>>> from taking space, by handling kprobes & ftrace that I missed
>>>>>>>>> (thanks
>>>>>>>>> again
>>>>>>>>> for
>>>>>>>>> pointing it out).
>>>>>>>>> 
>>>>>>>>> But all of the sudden, I don’t understand why we have the
>>>>>>>>> problem
>>>>>>>>> that
>>>>>>>>> this
>>>>>>>>> (your) patch-set deals with at all. We already change the
>>>>>>>>> mappings
>>>>>>>>> to
>>>>>>>>> make
>>>>>>>>> the memory writable before freeing the memory, so why can’t we
>>>>>>>>> make
>>>>>>>>> it
>>>>>>>>> non-executable at the same time? Actually, why do we make the
>>>>>>>>> module
>>>>>>>>> memory,
>>>>>>>>> including its data executable before freeing it???
>>>>>>>> 
>>>>>>>> Yeah, this is really confusing, but I have a suspicion it's a
>>>>>>>> combination
>>>>>>>> of the various different configurations and hysterical raisins. We
>>>>>>>> can't
>>>>>>>> rely on module_alloc() allocating from the vmalloc area (see
>>>>>>>> nios2)
>>>>>>>> nor
>>>>>>>> can we rely on disable_ro_nx() being available at build time.
>>>>>>>> 
>>>>>>>> If we *could* rely on module allocations always using vmalloc(),
>>>>>>>> then
>>>>>>>> we could pass in Rick's new flag and drop disable_ro_nx()
>>>>>>>> altogether
>>>>>>>> afaict -- who cares about the memory attributes of a mapping
>>>>>>>> that's
>>>>>>>> about
>>>>>>>> to disappear anyway?
>>>>>>>> 
>>>>>>>> Is it just nios2 that does something different?
>>>>>>>> 
>>>>>>>> Will
>>>>>>> 
>>>>>>> Yea it is really intertwined. I think for x86, set_memory_nx
>>>>>>> everywhere
>>>>>>> would
>>>>>>> solve it as well, in fact that was what I first thought the solution
>>>>>>> should
>>>>>>> be
>>>>>>> until this was suggested. It's interesting that from the other
>>>>>>> thread
>>>>>>> Masami
>>>>>>> Hiramatsu referenced, set_memory_nx was suggested last year and
>>>>>>> would
>>>>>>> have
>>>>>>> inadvertently blocked this on x86. But, on the other architectures I
>>>>>>> have
>>>>>>> since
>>>>>>> learned it is a bit different.
>>>>>>> 
>>>>>>> It looks like actually most arch's don't re-define set_memory_*, and
>>>>>>> so
>>>>>>> all
>>>>>>> of
>>>>>>> the frob_* functions are actually just noops. In which case
>>>>>>> allocating
>>>>>>> RWX
>>>>>>> is
>>>>>>> needed to make it work at all, because that is what the allocation
>>>>>>> is
>>>>>>> going
>>>>>>> to
>>>>>>> stay at. So in these archs, set_memory_nx won't solve it because it
>>>>>>> will
>>>>>>> do
>>>>>>> nothing.
>>>>>>> 
>>>>>>> On x86 I think you cannot get rid of disable_ro_nx fully because
>>>>>>> there
>>>>>>> is
>>>>>>> the
>>>>>>> changing of the permissions on the directmap as well. You don't want
>>>>>>> some
>>>>>>> other
>>>>>>> caller getting a page that was left RO when freed and then trying to
>>>>>>> write
>>>>>>> to
>>>>>>> it, if I understand this.
>>>>>>> 
>>>>>>> The other reasoning was that calling set_memory_nx isn't doing what
>>>>>>> we
>>>>>>> are
>>>>>>> actually trying to do which is prevent the pages from getting
>>>>>>> released
>>>>>>> too
>>>>>>> early.
>>>>>>> 
>>>>>>> A more clear solution for all of this might involve refactoring some
>>>>>>> of
>>>>>>> the
>>>>>>> set_memory_ de-allocation logic out into __weak functions in either
>>>>>>> modules
>>>>>>> or
>>>>>>> vmalloc. As Jessica points out in the other thread though, modules
>>>>>>> does
>>>>>>> a
>>>>>>> lot
>>>>>>> more stuff there than the other module_alloc callers. I think it may
>>>>>>> take
>>>>>>> some
>>>>>>> thought to centralize AND make it optimal for every
>>>>>>> module_alloc/vmalloc_exec
>>>>>>> user and arch.
>>>>>>> 
>>>>>>> But for now with the change in vmalloc, we can block the executable
>>>>>>> mapping
>>>>>>> freed page re-use issue in a cross platform way.
>>>>>> 
>>>>>> Please understand me correctly - I didn’t mean that your patches are
>>>>>> not
>>>>>> needed.
>>>>> 
>>>>> Ok, I think I understand. I have been pondering these same things after
>>>>> Masami
>>>>> Hiramatsu's comments on this thread the other day.
>>>>> 
>>>>>> All I did is asking - how come the PTEs are executable when they are
>>>>>> cleared
>>>>>> they are executable, when in fact we manipulate them when the module
>>>>>> is
>>>>>> removed.
>>>>> 
>>>>> I think the directmap used to be RWX so maybe historically its trying to
>>>>> return
>>>>> it to its default state? Not sure.
>>>>> 
>>>>>> I think I try to deal with a similar problem to the one you encounter
>>>>>> -
>>>>>> broken W^X. The only thing that bothered me in regard to your patches
>>>>>> (and
>>>>>> only after I played with the code) is that there is still a time-
>>>>>> window in
>>>>>> which W^X is broken due to disable_ro_nx().
>>>>> 
>>>>> Totally agree there is overlap in the fixes and we should sync.
>>>>> 
>>>>> What do you think about Andy's suggestion for doing the vfree cleanup in
>>>>> vmalloc
>>>>> with arch hooks? So the allocation goes into vfree fully setup and
>>>>> vmalloc
>>>>> frees
>>>>> it and on x86 resets the direct map.
>>>> 
>>>> As long as you do it, I have no problem ;-)
>>>> 
>>>> You would need to consider all the callers of module_memfree(), and
>>>> probably
>>>> to untangle at least part of the mess in pageattr.c . If you are up to it,
>>>> just say so, and I’ll drop this patch. All I can say is “good luck with
>>>> all
>>>> that”.
>>> 
>>> I thought you were trying to prevent having any memory that at any time was
>>> W+X,
>>> how does vfree help with the module load time issues, where it starts WRX on
>>> x86?
>> 
>> I didn’t say it does. The patch I submitted before [1] should deal with the
>> issue of module loading, and I still think it is required. I also addressed
>> the kprobe and ftrace issues that you raised.
>> 
>> Perhaps it makes more sense that I will include the patch I proposed for
>> module cleanup to make the patch-set “complete”. If you finish the changes
>> you propose before the patch is applied, it could be dropped. I just want to
>> get rid of this series, as it keeps collecting more and more patches.
>> 
>> I suspect it will not be the last version anyhow.
>> 
>> [1] https://lkml.org/lkml/2018/11/21/305
> 
> That seems fine.
> 
> And not to make it any more complicated, but how much different is a W+X mapping
> from a RW mapping that is about to turn X? Can't an attacker with the ability to
> write to the module space just write and wait a short time? If that is the
> threat model, I think there may still be additional work to do here even after
> you found all the W+X cases.

I agree that a complete solution may require to block any direct write onto
a code-page. When I say “complete”, I mean for a threat model in which
dangling pointers are used to inject code, but not to run existing ROP/JOP
gadgets. (I didn’t think too deeply on the threat-model, so perhaps it needs
to be further refined).

I think the first stage is to make everybody go through a unified interface
(text_poke() and text_poke_early()). ftrace, for example, uses an
independent mechanism to change the code.

Eventually, after boot text_poke_early() should not be used, and text_poke()
(or something similar) should be used instead. Alternatively, when module
text is loaded, a hash value can be computed and calculated over it.

Since Igor Stoppa wants to use the infrastructure that is included in the
first patches, and since I didn’t intend this patch-set to be a full
solution for W^X (I was pushed there by tglx+Andy [1]), it may be enough
as a first step.

[1] https://lore.kernel.org/patchwork/patch/1006293/#1191341