Re: [PATCH v4 2/4] vm: add a syscall to map a process memory into a pipe

Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> · Mon, 27 Nov 2017 15:42:49 -0800

On Mon, 27 Nov 2017 09:19:39 +0200 Mike Rapoport <rppt@xxxxxxxxxxxxxxxxxx> wrote:

> From: Andrei Vagin <avagin@xxxxxxxxxxxxx>
> 
> It is a hybrid of process_vm_readv() and vmsplice().
> 
> vmsplice can map memory from a current address space into a pipe.
> process_vm_readv can read memory of another process.
> 
> A new system call can map memory of another process into a pipe.
> 
> ssize_t process_vmsplice(pid_t pid, int fd, const struct iovec *iov,
>                         unsigned long nr_segs, unsigned int flags)
> 
> All arguments are identical with vmsplice except pid which specifies a
> target process.
> 
> Currently if we want to dump a process memory to a file or to a socket,
> we can use process_vm_readv() + write(), but it works slow, because data
> are copied into a temporary user-space buffer.
> 
> A second way is to use vmsplice() + splice(). It is more effective,
> because data are not copied into a temporary buffer, but here is another
> problem. vmsplice works with the currect address space, so it can be
> used only if we inject our code into a target process.
> 
> The second way suffers from a few other issues:
> * a process has to be stopped to run a parasite code
> * a number of pipes is limited, so it may be impossible to dump all
>   memory in one iteration, and we have to stop process and inject our
>   code a few times.
> * pages in pipes are unreclaimable, so it isn't good to hold a lot of
>   memory in pipes.
> 
> The introduced syscall allows to use a second way without injecting any
> code into a target process.
> 
> My experiments shows that process_vmsplice() + splice() works two time
> faster than process_vm_readv() + write().
>
> It is particularly useful on a pre-dump stage. On this stage we enable a
> memory tracker, and then we are dumping  a process memory while a
> process continues work. On the first iteration we are dumping all
> memory, and then we are dumpung only modified memory from a previous
> iteration.  After a few pre-dump operations, a process is stopped and
> dumped finally. The pre-dump operations allow to significantly decrease
> a process downtime, when a process is migrated to another host.

What is the overall improvement in a typical dumping operation?

Does that improvement justify the addition of a new syscall, and all
that this entails?  If so, why?

Are there any other applications of this syscall?