Re: [PATCH v9 RESEND 0/4] KASLR feature to randomize each loadable module

"Edgecombe, Rick P" <rick.p.edgecombe@xxxxxxxxx> · Tue, 18 Dec 2018 01:27:22 +0000

On Mon, 2018-12-17 at 05:41 +0100, Jessica Yu wrote:
> +++ Edgecombe, Rick P [12/12/18 23:05 +0000]:
> > On Wed, 2018-11-28 at 01:40 +0000, Edgecombe, Rick P wrote:
> > > On Tue, 2018-11-27 at 11:21 +0100, Daniel Borkmann wrote:
> > > > On 11/27/2018 01:19 AM, Edgecombe, Rick P wrote:
> > > > > On Mon, 2018-11-26 at 16:36 +0100, Jessica Yu wrote:
> > > > > > +++ Rick Edgecombe [20/11/18 15:23 -0800]:
> > > > > 
> > > > > [snip]
> > > > > > Hi Rick!
> > > > > > 
> > > > > > Sorry for the delay. I'd like to take a step back and ask some
> > > > > > broader
> > > > > > questions -
> > > > > > 
> > > > > > - Is the end goal of this patchset to randomize loading kernel
> > > > > > modules,
> > > > > > or
> > > > > > most/all
> > > > > >    executable kernel memory allocations, including bpf, kprobes,
> > > > > > etc?
> > > > > 
> > > > > Thanks for taking a look!
> > > > > 
> > > > > It started with the goal of just randomizing modules (hence the name),
> > > > > but
> > > > > I
> > > > > think there is maybe value in randomizing the placement of all runtime
> > > > > added
> > > > > executable code. Beyond just trying to make executable code placement
> > > > > less
> > > > > deterministic in general, today all of the usages have the property of
> > > > > starting
> > > > > with RW permissions and then becoming RO executable, so there is the
> > > > > benefit
> > > > > of
> > > > > narrowing the chances a bug could successfully write to it during the
> > > > > RW
> > > > > window.
> > > > > 
> > > > > > - It seems that a lot of complexity and heuristics are introduced
> > > > > > just
> > > > > > to
> > > > > >    accommodate the potential fragmentation that can happen when the
> > > > > > module
> > > > > > vmalloc
> > > > > >    space starts to get fragmented with bpf filters. I'm partial to
> > > > > > the
> > > > > > idea of
> > > > > >    splitting or having bpf own its own vmalloc space, similar to
> > > > > > what
> > > > > > Ard
> > > > > > is
> > > > > > already
> > > > > >    implementing for arm64.
> > > > > > 
> > > > > >    So a question for the bpf and x86 folks, is having a dedicated
> > > > > > vmalloc
> > > > > > region
> > > > > >    (as well as a seperate bpf_alloc api) for bpf feasible or
> > > > > > desirable
> > > > > > on
> > > > > > x86_64?
> > > > > 
> > > > > I actually did some prototyping and testing on this. It seems there
> > > > > would
> > > > > be
> > > > > some slowdown from the required changes to the JITed code to support
> > > > > calling
> > > > > back from the vmalloc region into the kernel, and so module space
> > > > > would
> > > > > still be
> > > > > the preferred region.
> > > > 
> > > > Yes, any runtime slow-down would be no-go as BPF sits in the middle of
> > > > critical
> > > > networking fast-path and e.g. on XDP or tc layer and is used in load-
> > > > balancing,
> > > > firewalling, DDoS protection scenarios, some recent examples in [0-3].
> > > > 
> > > >   [0] http://vger.kernel.org/lpc-networking2018.html#session-10
> > > >   [1] http://vger.kernel.org/lpc-networking2018.html#session-15
> > > >   [2] https://blog.cloudflare.com/how-to-drop-10-million-packets/
> > > >   [3] http://vger.kernel.org/lpc-bpf2018.html#session-1
> > > > 
> > > > > >    If bpf filters need to be within 2 GB of the core kernel, would
> > > > > > it
> > > > > > make
> > > > > > sense
> > > > > >    to carve out a portion of the current module region for bpf
> > > > > > filters?  According
> > > > > >    to Documentation/x86/x86_64/mm.txt, the module region is ~1.5 GB.
> > > > > > I
> > > > > > am
> > > > > > doubtful
> > > > > >    that any real system will actually have 1.5 GB worth of kernel
> > > > > > modules
> > > > > > loaded.
> > > > > >    Is there a specific reason why that much space is dedicated to
> > > > > > kernel
> > > > > > modules,
> > > > > >    and would it be feasible to split that region cleanly with bpf?
> > > > > 
> > > > > Hopefully someone from BPF side of things will chime in, but my
> > > > > understanding
> > > > > was that they would like even more space than today if possible and so
> > > > > they
> > > > > may
> > > > > not like the reduced space.
> > > > 
> > > > I wouldn't mind of the region is split as Jessica suggests but in a way
> > > > where
> > > > there would be _no_ runtime regressions for BPF. This might also allow
> > > > to
> > > > have
> > > > more flexibility in sizing the area dedicated for BPF in future, and
> > > > could
> > > > potentially be done in similar way as Ard was proposing recently [4].
> > > > 
> > > >   [4] https://patchwork.ozlabs.org/project/netdev/list/?series=77779
> > > 
> > > CCing Ard.
> > > 
> > > The benefit of sharing the space, for randomization at least, is that you
> > > can
> > > spread the allocations over a larger area.
> > > 
> > > I think there are also other benefits to unifying how this memory is
> > > managed
> > > though, rather than spreading it further. Today there are various patterns
> > > and
> > > techniques used like calling different combinations of set_memory_* before
> > > freeing, zeroing in modules or setting invalid instructions like BPF does,
> > > etc.
> > > There is also special care to be taken on vfree-ing executable memory. So
> > > this
> > > way things only have to be done right once and there is less duplication.
> > > 
> > > Not saying there shouldn't be __weak alloc and free method in BPF for arch
> > > specific behavior, just that there is quite a few other concerns that
> > > could be
> > > good to centralize even more than today.
> > > 
> > > What if there was a unified executable alloc API with support for things
> > > like:
> > >  - Concepts of two regions for Ard's usage, near(modules) and far(vmalloc)
> > > from
> > >    kernel text. Won't apply for every arch, but maybe enough that some
> > > logic
> > >    could be unified
> > >  - Limits for each of the usages (modules, bpf, kprobes, ftrace)
> > >  - Centralized logic for moving between RW and RO+X
> > >  - Options for exclusive regions or all shared
> > >  - Randomizing base, randomizing independently or none
> > >  - Some cgroups hooks?
> > > 
> > > Would there be any interest in that for the future?
> > > 
> > > As a next step, if BPF doesn't want to use this by default, could BPF just
> > > call
> > > vmalloc_node_range directly from Ard's new __weak functions on x86? Then
> > > modules
> > > can randomize across the whole space and BPF can fill the gaps linearly
> > > from
> > > the
> > > beginning. Is that acceptable? Then the vmalloc optimizations could be
> > > dropped
> > > for the time being since the BPFs would not be fragmented, but the
> > > separate
> > > regions could come as part of future work.
> > 
> > Jessica, Daniel,
> > 
> > Any advice for me on how we could move this forward?
> 
> Hi Rick,
> 
> It would be good for the x86 folks to chime in if they find the
> x86-related module changes agreeable (in particular, the partitioning
> and sizing of the module space in separate randomization and backup
> areas). Has that happened already or did I just miss that in the
> previous versions?
Andrew Morton(on v8) and Kees Cook(way back on v1 IIRC) had asked if we need the
backup area at all. The answer is yes in the case of heavy usage from the other
module_alloc users, or late added large modules have a real world chance of
being blocked.

The sizes of the areas were chosen experimentally with the simulations, but I
didn't save the data.

Anyone in particular you would want to see comment on this?

> I'm impartial towards the vmalloc optimizations, as I wouldn't
> consider module loading performance-critical (For instance, you'd most
> likely just load a driver once and be done with it, and it's not like
> you'd very frequently be loading/unloading modules. And note I mean
> loading a kernel module, not module_alloc() allocations. These two
> concepts are starting to get conflated :-/ ). So, I'd leave the
> optimizations up to the BPF folks if they consider that beneficial for
> their module_alloc() allocations.
Daniel, Alexei,

Any thoughts how you would prefer this works with BPF JIT?

> And it looks like there isn't really a strong push or interest on
> having a separate vmalloc area for bpf, so I suppose we can drop that
> idea for now (that would be a separate patchset on its own anyway).
> I just suggested the idea because I was curious if that would have
> helped with the potential fragmentation issues. In any case it sounded
> like the potentially reduced space (should the module space be split
> between bpf and modules) isn't desirable.
[snip]