On Wed, May 17, 2023 at 05:04:27PM +0300, Mike Rapoport wrote: > On Wed, May 17, 2023 at 01:28:43AM -0400, Kent Overstreet wrote: > > On Tue, May 16, 2023 at 10:47:13PM +0100, Matthew Wilcox wrote: > > > On Tue, May 16, 2023 at 05:20:33PM -0400, Kent Overstreet wrote: > > > > On Tue, May 16, 2023 at 02:02:11PM -0700, Kees Cook wrote: > > > > > For something that small, why not use the text_poke API? > > > > > > > > This looks like it's meant for patching existing kernel text, which > > > > isn't what I want - I'm generating new functions on the fly, one per > > > > btree node. > > > > > > > > I'm working up a new allocator - a (very simple) slab allocator where > > > > you pass a buffer, and it gives you a copy of that buffer mapped > > > > executable, but not writeable. > > > > > > > > It looks like we'll be able to convert bpf, kprobes, and ftrace > > > > trampolines to it; it'll consolidate a fair amount of code (particularly > > > > in bpf), and they won't have to burn a full page per allocation anymore. > > > > > > > > bpf has a neat trick where it maps the same page in two different > > > > locations, one is the executable location and the other is the writeable > > > > location - I'm stealing that. > > > > > > How does that avoid the problem of being able to construct an arbitrary > > > gadget that somebody else will then execute? IOW, what bpf has done > > > seems like it's working around & undoing the security improvements. > > > > > > I suppose it's an improvement that only the executable address is > > > passed back to the caller, and not the writable address. > > > > Ok, here's what I came up with. Have not tested all corner cases, still > > need to write docs - but I think this gives us a nicer interface than > > what bpf/kprobes/etc. have been doing, and it does the sub-page sized > > allocations I need. > > > > With an additional tweak to module_alloc() (not done in this patch yet) > > we avoid ever mapping in pages both writeable and executable: > > > > -->-- > > > > From 6eeb6b8ef4271ea1a8d9cac7fbaeeb7704951976 Mon Sep 17 00:00:00 2001 > > From: Kent Overstreet <kent.overstreet@xxxxxxxxx> > > Date: Wed, 17 May 2023 01:22:06 -0400 > > Subject: [PATCH] mm: jit/text allocator > > > > This provides a new, very simple slab allocator for jit/text, i.e. bpf, > > ftrace trampolines, or bcachefs unpack functions. > > > > With this API we can avoid ever mapping pages both writeable and > > executable (not implemented in this patch: need to tweak > > module_alloc()), and it also supports sub-page sized allocations. > > This looks like yet another workaround for that module_alloc() was not > designed to handle permission changes. Rather than create more and more > wrappers for module_alloc() we need to have core API for code allocation, > apparently on top of vmalloc, and then use that API for modules, bpf, > tracing and whatnot. > > There was quite lengthy discussion about how to handle code allocations > here: > > https://lore.kernel.org/linux-mm/20221107223921.3451913-1-song@xxxxxxxxxx/ Thanks for the link! Added Song to the CC. Song, I'm looking at your code now - switching to hugepages is great, but I wonder if we might be able to combine our two approaches - with the slab allocator I did, do we have to bother with VMAs at all? And then it gets us sub-page sized allocations. > and Song is already working on improvements for module_alloc(), e.g. see > commit ac3b43283923 ("module: replace module_layout with module_memory") > > Another thing, the code below will not even compile on !x86. Due to text_poke(), which I see is abstracted better in that patchset. I'm very curious why text_poke() does tlb flushing at all; it seems like flush_icache_range() is actually what's needed? text_poke() also only touching up to two pages, without that being documented, is also a footgun... And I'm really curious why text_poke() is needed at all. Seems like we could just use kmap_local() to create a temporary writeable mapping, except in my testing that got me a RO mapping. Odd.