On Wed, 14 Feb 2024 15:01:25 +0100 Greg Kroah-Hartman <gregkh@xxxxxxxxxxxxxxxxxxx> wrote: > On Wed, Feb 14, 2024 at 05:30:53AM -0800, Andrew Morton wrote: > > On Wed, 14 Feb 2024 12:30:35 +0100 Petr Tesarik <petrtesarik@xxxxxxxxxxxxxxx> wrote: > > > > > +Although data structures are not serialized and deserialized between kernel > > > +mode and sandbox mode, all directly and indirectly referenced data structures > > > +must be explicitly mapped into the sandbox, which requires some manual effort. > > > > Maybe I'm missing something here, but... > > > > The requirement that the sandboxed function only ever touch two linear > > blocks of memory (yes?) seems a tremendous limitation. I mean, how can > > the sandboxed function call kmalloc()? How can it call any useful > > kernel functions? They'll all touch memory which lies outside the > > sandbox areas? > > > > Perhaps a simple but real-world example would help clarify. > > I agree, this looks like an "interesting" framework, but we don't add > code to the kernel without a real, in-kernel user for it. > > Without such a thing, we can't even consider it for inclusion as we > don't know how it will actually work and how any subsystem would use it. > > Petr, do you have an user for this today? Hi Greg & Andrew, your observations is correct. In this form, the framework is quite limited, and exactly this objections was expected. You have even spotted one of the first enhancements I tested on top of this framework (dynamic memory allocation). The intended use case is code that processes untrusted data that is not properly sanitized, but where performance is not critical. Some examples include decompressing initramfs, loading a kernel module. Or decoding a boot logo; I think I've noticed a vulnerability in another project recently... ;-) Of course, even decompression needs dynamic memory. My plan is to extend the mechanism. Right now I'm mapping all of kernel text into the sandbox. Later, I'd like to decompose the text section too. The pages which contain sandboxed code should be mapped, but rest of the kernel should not. If the sandbox tries to call kmalloc(), vmalloc(), or schedule(), the attempt will generate a page fault. Sandbox page faults are already intercepted, so handle_sbm_call() can decide if the call should be allowed or not. If the sandbox policy says ALLOW, the page fault handler will perform the API call on behalf of the sandboxed code and return results, possibly with some post-call action, e.g. map some more pages to the address space. The fact that all communication with the rest of the kernel happens through CPU exceptions is the reason this mechanism is unsuitable for performance-critical applications. OK, so why didn't I send the whole thing? Decomposition of the kernel requires many more changes, e.g. in linker scripts. Some of them depend on this patch series. Before I go and clean up my code into something that can be submitted, I want to get feedback from guys like you, to know if the whole idea would be even considered, aka "Fail Fast". Petr T