On 21.05.2019 20:28, Jann Horn wrote: > On Tue, May 21, 2019 at 7:04 PM Kirill Tkhai <ktkhai@xxxxxxxxxxxxx> wrote: >> On 21.05.2019 19:20, Jann Horn wrote: >>> On Tue, May 21, 2019 at 5:52 PM Kirill Tkhai <ktkhai@xxxxxxxxxxxxx> wrote: >>>> On 21.05.2019 17:43, Andy Lutomirski wrote: >>>>> On Mon, May 20, 2019 at 7:01 AM Kirill Tkhai <ktkhai@xxxxxxxxxxxxx> wrote: >>>>>> New syscall, which allows to clone a remote process VMA >>>>>> into local process VM. The remote process's page table >>>>>> entries related to the VMA are cloned into local process's >>>>>> page table (in any desired address, which makes this different >>>>>> from that happens during fork()). Huge pages are handled >>>>>> appropriately. >>> [...] >>>>>> There are several problems with process_vm_writev() in this example: >>>>>> >>>>>> 1)it causes pagefault on remote process memory, and it forces >>>>>> allocation of a new page (if was not preallocated); >>>>> >>>>> I don't see how your new syscall helps. You're writing to remote >>>>> memory. If that memory wasn't allocated, it's going to get allocated >>>>> regardless of whether you use a write-like interface or an mmap-like >>>>> interface. >>>> >>>> No, the talk is not about just another interface for copying memory. >>>> The talk is about borrowing of remote task's VMA and corresponding >>>> page table's content. Syscall allows to copy part of page table >>>> with preallocated pages from remote to local process. See here: >>>> >>>> [task1] [task2] >>>> >>>> buf = mmap(NULL, n * PAGE_SIZE, PROT_READ|PROT_WRITE, >>>> MAP_PRIVATE|MAP_ANONYMOUS, ...); >>>> >>>> <task1 populates buf> >>>> >>>> buf = process_vm_mmap(pid_of_task1, addr, n * PAGE_SIZE, ...); >>>> munmap(buf); >>>> >>>> >>>> process_vm_mmap() copies PTEs related to memory of buf in task1 to task2 >>>> just like in the way we do during fork syscall. >>>> >>>> There is no copying of buf memory content, unless COW happens. This is >>>> the principal difference to process_vm_writev(), which just allocates >>>> pages in remote VM. >>>> >>>>> Keep in mind that, on x86, just the hardware part of a >>>>> page fault is very slow -- populating the memory with a syscall >>>>> instead of a fault may well be faster. >>>> >>>> It is not as slow, as disk IO has. Just compare, what happens in case of anonymous >>>> pages related to buf of task1 are swapped: >>>> >>>> 1)process_vm_writev() reads them back into memory; >>>> >>>> 2)process_vm_mmap() just copies swap PTEs from task1 page table >>>> to task2 page table. >>>> >>>> Also, for faster page faults one may use huge pages for the mappings. >>>> But really, it's funny to think about page faults, when there are >>>> disk IO problems I shown. >>> [...] >>>>> That only doubles the amount of memory if you let n >>>>> scale linearly with p, which seems unlikely. >>>>> >>>>>> >>>>>> 3)received data has no a chance to be properly swapped for >>>>>> a long time. >>>>> >>>>> ... >>>>> >>>>>> a)kernel moves @buf pages into swap right after recv(); >>>>>> b)process_vm_writev() reads the data back from swap to pages; >>>>> >>>>> If you're under that much memory pressure and thrashing that badly, >>>>> your performance is going to be awful no matter what you're doing. If >>>>> you indeed observe this behavior under normal loads, then this seems >>>>> like a VM issue that should be addressed in its own right. >>>> >>>> I don't think so. Imagine: a container migrates from one node to another. >>>> The nodes are the same, say, every of them has 4GB of RAM. >>>> >>>> Before the migration, the container's tasks used 4GB of RAM and 8GB of swap. >>>> After the page server on the second node received the pages, we want these >>>> pages become swapped as soon as possible, and we don't want to read them from >>>> swap to pass a read consumer. >>> >>> But you don't have to copy that memory into the container's tasks all >>> at once, right? Can't you, every time you've received a few dozen >>> kilobytes of data or whatever, shove them into the target task? That >>> way you don't have problems with swap because the time before the data >>> has arrived in its final VMA is tiny. >> >> We try to maintain online migration with as small downtime as possible, >> and the container on source node is completely stopped at the very end. >> Memory of container tasks is copied in background without container >> completely stop, and _PAGE_SOFT_DIRTY is used to track dirty pages. >> >> Container may create any new processes during the migration, and these >> processes may contain any memory mappings. >> >> Imagine the situation. We migrate a big web server with a lot of processes, >> and some of children processes have the same COW mapping as parent has. >> In case of all memory dump is available at the moment of the grand parent >> web server process creation, we populate the mapping in parent, and all >> the children may inherit the mapping in case of they want after fork. >> COW works here. But in case of some processes are created before all memory >> is available on destination node, we can't do such the COW inheritance. >> This will be the reason, the memory consumed by container grows many >> times after the migration. So, the only solution is to create process >> tree after memory is available and all mappings are known. > > But if one of the processes modifies the memory after you've started > migrating it to the new machine, that memory can't be CoW anymore > anyway, right? So it should work if you first do a first pass of > copying the memory and creating the process hierarchy, and then copy > more recent changes into the individual processes, breaking the CoW > for those pages, right? Not so. We have to have all processes killed on source node, before creation the same process tree on destination machine. The process tree should be completely analyzed before a try of its recreation from ground. The analysis allows to choose the strategy and the sequence of each process creation and inheritance of entities like namespaces, mm, fdtables, etc. It's impossible to restore a process tree in case of you already have started to create it, but haven't stopped processes on source node. Also, we can restore only subset of process trees, but you never know what will happen with a live process tree in further. So, source process tree must be freezed before restore. A restore of arbitrary process tree in laws of all linux limitations on sequence of action to recreate an entity of a process makes this nontrivial mathematical problem, which has no a solution at the moment, and it's unknown whether it has a solution. So, at the moment of restore starts, we know all about all tasks and their relation ships. Yes, we start to copy memory from source to destination, when container on source is alive. But we track this memory with _PAGE_SOFT_DIRTY flag, and dirtied pages are repopulated with new content. Please, see the comment about hellish below. >> It's on of the examples. But believe me, there are a lot of another reasons, >> why process tree should be created only after all process tree is freezed, >> and no new tasks on source are possible. PGID and SSID inheritance, for >> example. All of this requires special order of tasks creation. In case of >> you try to restore process tree with correct namespaces and especial in >> case of many user namespaces in a container, you will just see like a hell >> will open before your eyes, and we never can think about this. > > Could you elaborate on why that is so hellish? Because you never know the way, the system came into the state you're seeing at the moment. Even in a simple process chain like: task1 | task2 | task3 / \ task4 task5 any of these processes may change its namespace, pgid, ssid, unshare mm, files, become a child subreaper, die and make its children reparented, do all of these before or after parent did one of its actions. All of these actions are not independent between each other, and some actions prohibit another actions. And you can restore the process tree only in case you repeat all of the sequence in the same order it was made by the container. It's impossible to say "task2, set your session to 5!", because the only way to set ssid is call setsid() syscall, which may be made only once in a process life. But setsid itself implies limitations on further setpgid() syscall (see the code, if interesting). And this limitation does not mean we always should call setpgid() before it, no, these are no such the simple rules. The same and moreover with inheritance of namespaces, when some of task's children may inherit old task's namespaces, some may inherit current, some will inherit further. A child will be able to assign a namespace, say, net ns, only in case of its userns allows to do this. This implies another limitation on process creation order, while this does not mean this gives a stable rule of choosing an order you create process tree. No, the only rule is "in the moment of time, one of tasks should made one of the above actions". This is a mathematical problem of ordering finite number of actors with finite number of rules and relationship between them, and the rules do not imply unambiguous replay from the ground by the end state. We behave well in limited and likely cases of process tree configurations, and this is enough for most cases. But common problem is very difficult, and currently it's not proven it even has a solution as a finite set of rules you may apply to restore any process tree. Kirill