Re: [PATCH 0/4 POC] Allow executing code and syscalls in another address space

Benjamin Berg <benjamin@xxxxxxxxxxxxxxxx> · Wed, 14 Apr 2021 11:24:19 +0200

On Wed, 2021-04-14 at 09:34 +0200, Johannes Berg wrote:
> On Wed, 2021-04-14 at 08:22 +0100, Anton Ivanov wrote:
> > On 14/04/2021 06:52, Andrei Vagin wrote:
> > > We already have process_vm_readv and process_vm_writev to read and
> > > write
> > > to a process memory faster than we can do this with ptrace. And now
> > > it
> > > is time for process_vm_exec that allows executing code in an
> > > address
> > > space of another process. We can do this with ptrace but it is much
> > > slower.
> > > 
> > > = Use-cases =
> > > 
> > > Here are two known use-cases. The first one is “application kernel”
> > > sandboxes like User-mode Linux and gVisor. In this case, we have a
> > > process that runs the sandbox kernel and a set of stub processes
> > > that
> > > are used to manage guest address spaces. Guest code is executed in
> > > the
> > > context of stub processes but all system calls are intercepted and
> > > handled in the sandbox kernel. Right now, these sort of sandboxes
> > > use
> > > PTRACE_SYSEMU to trap system calls, but the process_vm_exec can
> > > significantly speed them up.
> > 
> > Certainly interesting, but will require um to rework most of its
> > memory 
> > management and we will most likely need extra mm support to make use
> > of 
> > it in UML. We are not likely to get away just with one syscall there.
> 
> Might help the seccomp mode though:
> 
> https://patchwork.ozlabs.org/project/linux-um/list/?series=231980

Hmm, to me it sounds like it replaces both ptrace and seccomp mode
while completely avoiding the scheduling overhead that these techniques
have. I think everything UML needs is covered:

 * The new API can do syscalls in the target memory space
   (we can modify the address space)
 * The new API can run code until the next syscall happens
   (or a signal happens, which means SIGALRM for scheduling works)
 * Single step tracing should work by setting EFLAGS

I think the memory management itself stays fundamentally the same. We
just do the initial clone() using CLONE_STOPPED. We don't need any stub
code/data and we have everything we need to modify the address space
and run the userspace process.

Benjamin
Attachment:
signature.asc

Description: This is a digitally signed message part