> On Dec 4, 2018, at 5:45 PM, Edgecombe, Rick P <rick.p.edgecombe@xxxxxxxxx> wrote: > > On Tue, 2018-12-04 at 16:53 -0800, Nadav Amit wrote: >>> On Dec 4, 2018, at 4:29 PM, Edgecombe, Rick P <rick.p.edgecombe@xxxxxxxxx> >>> wrote: >>> >>> On Tue, 2018-12-04 at 16:01 -0800, Nadav Amit wrote: >>>>> On Dec 4, 2018, at 3:51 PM, Edgecombe, Rick P < >>>>> rick.p.edgecombe@xxxxxxxxx> >>>>> wrote: >>>>> >>>>> On Tue, 2018-12-04 at 12:36 -0800, Nadav Amit wrote: >>>>>>> On Dec 4, 2018, at 12:02 PM, Edgecombe, Rick P < >>>>>>> rick.p.edgecombe@xxxxxxxxx> >>>>>>> wrote: >>>>>>> >>>>>>> On Tue, 2018-12-04 at 16:03 +0000, Will Deacon wrote: >>>>>>>> On Mon, Dec 03, 2018 at 05:43:11PM -0800, Nadav Amit wrote: >>>>>>>>>> On Nov 27, 2018, at 4:07 PM, Rick Edgecombe < >>>>>>>>>> rick.p.edgecombe@xxxxxxxxx> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>> Since vfree will lazily flush the TLB, but not lazily free the >>>>>>>>>> underlying >>>>>>>>>> pages, >>>>>>>>>> it often leaves stale TLB entries to freed pages that could >>>>>>>>>> get >>>>>>>>>> re- >>>>>>>>>> used. >>>>>>>>>> This is >>>>>>>>>> undesirable for cases where the memory being freed has special >>>>>>>>>> permissions >>>>>>>>>> such >>>>>>>>>> as executable. >>>>>>>>> >>>>>>>>> So I am trying to finish my patch-set for preventing transient >>>>>>>>> W+X >>>>>>>>> mappings >>>>>>>>> from taking space, by handling kprobes & ftrace that I missed >>>>>>>>> (thanks >>>>>>>>> again >>>>>>>>> for >>>>>>>>> pointing it out). >>>>>>>>> >>>>>>>>> But all of the sudden, I don’t understand why we have the >>>>>>>>> problem >>>>>>>>> that >>>>>>>>> this >>>>>>>>> (your) patch-set deals with at all. We already change the >>>>>>>>> mappings >>>>>>>>> to >>>>>>>>> make >>>>>>>>> the memory writable before freeing the memory, so why can’t we >>>>>>>>> make >>>>>>>>> it >>>>>>>>> non-executable at the same time? Actually, why do we make the >>>>>>>>> module >>>>>>>>> memory, >>>>>>>>> including its data executable before freeing it??? >>>>>>>> >>>>>>>> Yeah, this is really confusing, but I have a suspicion it's a >>>>>>>> combination >>>>>>>> of the various different configurations and hysterical raisins. We >>>>>>>> can't >>>>>>>> rely on module_alloc() allocating from the vmalloc area (see >>>>>>>> nios2) >>>>>>>> nor >>>>>>>> can we rely on disable_ro_nx() being available at build time. >>>>>>>> >>>>>>>> If we *could* rely on module allocations always using vmalloc(), >>>>>>>> then >>>>>>>> we could pass in Rick's new flag and drop disable_ro_nx() >>>>>>>> altogether >>>>>>>> afaict -- who cares about the memory attributes of a mapping >>>>>>>> that's >>>>>>>> about >>>>>>>> to disappear anyway? >>>>>>>> >>>>>>>> Is it just nios2 that does something different? >>>>>>>> >>>>>>>> Will >>>>>>> >>>>>>> Yea it is really intertwined. I think for x86, set_memory_nx >>>>>>> everywhere >>>>>>> would >>>>>>> solve it as well, in fact that was what I first thought the solution >>>>>>> should >>>>>>> be >>>>>>> until this was suggested. It's interesting that from the other >>>>>>> thread >>>>>>> Masami >>>>>>> Hiramatsu referenced, set_memory_nx was suggested last year and >>>>>>> would >>>>>>> have >>>>>>> inadvertently blocked this on x86. But, on the other architectures I >>>>>>> have >>>>>>> since >>>>>>> learned it is a bit different. >>>>>>> >>>>>>> It looks like actually most arch's don't re-define set_memory_*, and >>>>>>> so >>>>>>> all >>>>>>> of >>>>>>> the frob_* functions are actually just noops. In which case >>>>>>> allocating >>>>>>> RWX >>>>>>> is >>>>>>> needed to make it work at all, because that is what the allocation >>>>>>> is >>>>>>> going >>>>>>> to >>>>>>> stay at. So in these archs, set_memory_nx won't solve it because it >>>>>>> will >>>>>>> do >>>>>>> nothing. >>>>>>> >>>>>>> On x86 I think you cannot get rid of disable_ro_nx fully because >>>>>>> there >>>>>>> is >>>>>>> the >>>>>>> changing of the permissions on the directmap as well. You don't want >>>>>>> some >>>>>>> other >>>>>>> caller getting a page that was left RO when freed and then trying to >>>>>>> write >>>>>>> to >>>>>>> it, if I understand this. >>>>>>> >>>>>>> The other reasoning was that calling set_memory_nx isn't doing what >>>>>>> we >>>>>>> are >>>>>>> actually trying to do which is prevent the pages from getting >>>>>>> released >>>>>>> too >>>>>>> early. >>>>>>> >>>>>>> A more clear solution for all of this might involve refactoring some >>>>>>> of >>>>>>> the >>>>>>> set_memory_ de-allocation logic out into __weak functions in either >>>>>>> modules >>>>>>> or >>>>>>> vmalloc. As Jessica points out in the other thread though, modules >>>>>>> does >>>>>>> a >>>>>>> lot >>>>>>> more stuff there than the other module_alloc callers. I think it may >>>>>>> take >>>>>>> some >>>>>>> thought to centralize AND make it optimal for every >>>>>>> module_alloc/vmalloc_exec >>>>>>> user and arch. >>>>>>> >>>>>>> But for now with the change in vmalloc, we can block the executable >>>>>>> mapping >>>>>>> freed page re-use issue in a cross platform way. >>>>>> >>>>>> Please understand me correctly - I didn’t mean that your patches are >>>>>> not >>>>>> needed. >>>>> >>>>> Ok, I think I understand. I have been pondering these same things after >>>>> Masami >>>>> Hiramatsu's comments on this thread the other day. >>>>> >>>>>> All I did is asking - how come the PTEs are executable when they are >>>>>> cleared >>>>>> they are executable, when in fact we manipulate them when the module >>>>>> is >>>>>> removed. >>>>> >>>>> I think the directmap used to be RWX so maybe historically its trying to >>>>> return >>>>> it to its default state? Not sure. >>>>> >>>>>> I think I try to deal with a similar problem to the one you encounter >>>>>> - >>>>>> broken W^X. The only thing that bothered me in regard to your patches >>>>>> (and >>>>>> only after I played with the code) is that there is still a time- >>>>>> window in >>>>>> which W^X is broken due to disable_ro_nx(). >>>>> >>>>> Totally agree there is overlap in the fixes and we should sync. >>>>> >>>>> What do you think about Andy's suggestion for doing the vfree cleanup in >>>>> vmalloc >>>>> with arch hooks? So the allocation goes into vfree fully setup and >>>>> vmalloc >>>>> frees >>>>> it and on x86 resets the direct map. >>>> >>>> As long as you do it, I have no problem ;-) >>>> >>>> You would need to consider all the callers of module_memfree(), and >>>> probably >>>> to untangle at least part of the mess in pageattr.c . If you are up to it, >>>> just say so, and I’ll drop this patch. All I can say is “good luck with >>>> all >>>> that”. >>> >>> I thought you were trying to prevent having any memory that at any time was >>> W+X, >>> how does vfree help with the module load time issues, where it starts WRX on >>> x86? >> >> I didn’t say it does. The patch I submitted before [1] should deal with the >> issue of module loading, and I still think it is required. I also addressed >> the kprobe and ftrace issues that you raised. >> >> Perhaps it makes more sense that I will include the patch I proposed for >> module cleanup to make the patch-set “complete”. If you finish the changes >> you propose before the patch is applied, it could be dropped. I just want to >> get rid of this series, as it keeps collecting more and more patches. >> >> I suspect it will not be the last version anyhow. >> >> [1] https://lkml.org/lkml/2018/11/21/305 > > That seems fine. > > And not to make it any more complicated, but how much different is a W+X mapping > from a RW mapping that is about to turn X? Can't an attacker with the ability to > write to the module space just write and wait a short time? If that is the > threat model, I think there may still be additional work to do here even after > you found all the W+X cases. I agree that a complete solution may require to block any direct write onto a code-page. When I say “complete”, I mean for a threat model in which dangling pointers are used to inject code, but not to run existing ROP/JOP gadgets. (I didn’t think too deeply on the threat-model, so perhaps it needs to be further refined). I think the first stage is to make everybody go through a unified interface (text_poke() and text_poke_early()). ftrace, for example, uses an independent mechanism to change the code. Eventually, after boot text_poke_early() should not be used, and text_poke() (or something similar) should be used instead. Alternatively, when module text is loaded, a hash value can be computed and calculated over it. Since Igor Stoppa wants to use the infrastructure that is included in the first patches, and since I didn’t intend this patch-set to be a full solution for W^X (I was pushed there by tglx+Andy [1]), it may be enough as a first step. [1] https://lore.kernel.org/patchwork/patch/1006293/#1191341