On Wed, 2018-11-28 at 01:40 +0000, Edgecombe, Rick P wrote: > On Tue, 2018-11-27 at 11:21 +0100, Daniel Borkmann wrote: > > On 11/27/2018 01:19 AM, Edgecombe, Rick P wrote: > > > On Mon, 2018-11-26 at 16:36 +0100, Jessica Yu wrote: > > > > +++ Rick Edgecombe [20/11/18 15:23 -0800]: > > > > > > [snip] > > > > Hi Rick! > > > > > > > > Sorry for the delay. I'd like to take a step back and ask some broader > > > > questions - > > > > > > > > - Is the end goal of this patchset to randomize loading kernel modules, > > > > or > > > > most/all > > > > executable kernel memory allocations, including bpf, kprobes, etc? > > > > > > Thanks for taking a look! > > > > > > It started with the goal of just randomizing modules (hence the name), but > > > I > > > think there is maybe value in randomizing the placement of all runtime > > > added > > > executable code. Beyond just trying to make executable code placement less > > > deterministic in general, today all of the usages have the property of > > > starting > > > with RW permissions and then becoming RO executable, so there is the > > > benefit > > > of > > > narrowing the chances a bug could successfully write to it during the RW > > > window. > > > > > > > - It seems that a lot of complexity and heuristics are introduced just > > > > to > > > > accommodate the potential fragmentation that can happen when the > > > > module > > > > vmalloc > > > > space starts to get fragmented with bpf filters. I'm partial to the > > > > idea of > > > > splitting or having bpf own its own vmalloc space, similar to what > > > > Ard > > > > is > > > > already > > > > implementing for arm64. > > > > > > > > So a question for the bpf and x86 folks, is having a dedicated > > > > vmalloc > > > > region > > > > (as well as a seperate bpf_alloc api) for bpf feasible or desirable > > > > on > > > > x86_64? > > > > > > I actually did some prototyping and testing on this. It seems there would > > > be > > > some slowdown from the required changes to the JITed code to support > > > calling > > > back from the vmalloc region into the kernel, and so module space would > > > still be > > > the preferred region. > > > > Yes, any runtime slow-down would be no-go as BPF sits in the middle of > > critical > > networking fast-path and e.g. on XDP or tc layer and is used in load- > > balancing, > > firewalling, DDoS protection scenarios, some recent examples in [0-3]. > > > > [0] http://vger.kernel.org/lpc-networking2018.html#session-10 > > [1] http://vger.kernel.org/lpc-networking2018.html#session-15 > > [2] https://blog.cloudflare.com/how-to-drop-10-million-packets/ > > [3] http://vger.kernel.org/lpc-bpf2018.html#session-1 > > > > > > If bpf filters need to be within 2 GB of the core kernel, would it > > > > make > > > > sense > > > > to carve out a portion of the current module region for bpf > > > > filters? According > > > > to Documentation/x86/x86_64/mm.txt, the module region is ~1.5 GB. I > > > > am > > > > doubtful > > > > that any real system will actually have 1.5 GB worth of kernel > > > > modules > > > > loaded. > > > > Is there a specific reason why that much space is dedicated to kernel > > > > modules, > > > > and would it be feasible to split that region cleanly with bpf? > > > > > > Hopefully someone from BPF side of things will chime in, but my > > > understanding > > > was that they would like even more space than today if possible and so > > > they > > > may > > > not like the reduced space. > > > > I wouldn't mind of the region is split as Jessica suggests but in a way > > where > > there would be _no_ runtime regressions for BPF. This might also allow to > > have > > more flexibility in sizing the area dedicated for BPF in future, and could > > potentially be done in similar way as Ard was proposing recently [4]. > > > > [4] https://patchwork.ozlabs.org/project/netdev/list/?series=77779 > > CCing Ard. > > The benefit of sharing the space, for randomization at least, is that you can > spread the allocations over a larger area. > > I think there are also other benefits to unifying how this memory is managed > though, rather than spreading it further. Today there are various patterns and > techniques used like calling different combinations of set_memory_* before > freeing, zeroing in modules or setting invalid instructions like BPF does, > etc. > There is also special care to be taken on vfree-ing executable memory. So this > way things only have to be done right once and there is less duplication. > > Not saying there shouldn't be __weak alloc and free method in BPF for arch > specific behavior, just that there is quite a few other concerns that could be > good to centralize even more than today. > > What if there was a unified executable alloc API with support for things like: > - Concepts of two regions for Ard's usage, near(modules) and far(vmalloc) > from > kernel text. Won't apply for every arch, but maybe enough that some logic > could be unified > - Limits for each of the usages (modules, bpf, kprobes, ftrace) > - Centralized logic for moving between RW and RO+X > - Options for exclusive regions or all shared > - Randomizing base, randomizing independently or none > - Some cgroups hooks? > > Would there be any interest in that for the future? > > As a next step, if BPF doesn't want to use this by default, could BPF just > call > vmalloc_node_range directly from Ard's new __weak functions on x86? Then > modules > can randomize across the whole space and BPF can fill the gaps linearly from > the > beginning. Is that acceptable? Then the vmalloc optimizations could be dropped > for the time being since the BPFs would not be fragmented, but the separate > regions could come as part of future work. Jessica, Daniel, Any advice for me on how we could move this forward? Thanks, Rick > Thanks, > > Rick > > > > Also with KASLR on x86 its actually only 1GB, so it would only be 500MB > > > per > > > section (assuming kprobes, etc would share the non-module region, so just > > > two > > > sections). > > > > > > > - If bpf gets its own dedicated vmalloc space, and we stick to the > > > > single > > > > task > > > > of randomizing *just* kernel modules, could the vmalloc optimizations > > > > and > > > > the > > > > "backup" area be dropped? The benefits of the vmalloc optimizations > > > > seem to > > > > only be noticeable when we get to thousands of module_alloc > > > > allocations > > > > - > > > > again, a concern caused by bpf filters sharing the same space with > > > > kernel > > > > modules. > > > > > > I think the backup area may still be needed, for example if you have 200 > > > modules > > > evenly spaced inside 500MB there is only average ~2.5MB gap between them. > > > So > > > a > > > late added large module could still get blocked. > > > > > > > So tldr, it seems to me that the concern of fragmentation, the > > > > vmalloc > > > > optimizations, and the main purpose of the backup area - basically, > > > > the > > > > more > > > > complex parts of this patchset - stems squarely from the fact that > > > > bpf > > > > filters > > > > share the same space as modules on x86. If we were to focus on > > > > randomizing > > > > *just* kernel modules, and if bpf and modules had their own dedicated > > > > regions, > > > > then I *think* the concrete use cases for the backup area and the > > > > vmalloc > > > > optimizations (if we're strictly considering just kernel modules) > > > > would > > > > mostly disappear (please correct me if I'm in the wrong here). Then > > > > tackling the > > > > randomization of bpf allocations could potentially be a separate task > > > > on > > > > its own. > > > > > > Yes it seems then the vmalloc optimizations could be dropped then, but I > > > don't > > > think the backup area could be. Also the entropy would go down since there > > > would > > > be less possible positions and we would reduce the space available to BPF. > > > So > > > there are some downsides just to remove the vmalloc piece. > > > > > > Is your concern that vmalloc optimizations might regress something else? > > > There > > > is a middle ground vmalloc optimization where only the try_purge flag is > > > plumbed > > > through. The flag was most of the performance gained and with just that > > > piece it > > > should not change any behavior for the non-modules flows. Would that be > > > more > > > acceptable? > > > > > > > Thanks! > > > > > > > > Jessica > > > > > > > > > > [snip] > > > > > > >