On 01/03/2012 04:25 PM, Andrea Arcangeli wrote: > > > > So the problem is if we do it in > > > userland with the current functionality you'll run out of VMAs and > > > slowdown performance too much. > > > > > > But all you need is the ability to map single pages in the address > > > space. > > > > Would this also let you set different pgprots for different pages in the > > same VMA? It would be useful for write barriers in garbage collectors > > (such as boehm-gc). These do not have _that_ many VMAs, because every > > GC cycles could merge all of them back to a single VMA with PROT_READ > > permissions; however, they still put some strain on the VM subsystem. > > Changing permission sounds more tricky as more code may make > assumptions on the vma before checking the pte. > > Adding a magic unmapped pte entry sounds fairly safe because there's > the migration pte already used by migrate which halts page faults and > wait, that creates a precedent. So I guess we could reuse the same > code that already exists for the migration entry and we'd need to fire > a signal and returns to userland instead of waiting. The signal should > be invoked before the page fault will trigger again. Delivering signals is slow, and you can't use signalfd for it, because that can be routed to a different task. I would like an fd based protocol with an explicit ack so the other end can be implemented by the kernel, to use with RDMA. Kind of like how vhost-net talks to a guest via a kvm ioeventfd/irqfd. > Of course if the > signal returns and does nothing it'll loop at 100% cpu load but that's > ok. Maybe it's possible to tweak the permissions but it will need a > lot more thoughts. Specifically for anon pages marking them readonly > sounds possible if they are supposed to behave like regular COWs (not > segfaulting or anything), as you already can have a mixture of > readonly and read-write ptes (not to tell readonly KSM pages), but for > any other case it's non trivial. Last but not the least the API here > would be like a vma-less-mremap, moving a page from one address to > another without modifying the vmas, the permission tweak sounds more > like an mprotect, so I'm unsure if it could do both or if it should be > an optimization to consider independently. Doesn't this stuff require tlb flushes across all threads? > > In theory I suspect we could also teach mremap to do a > not-vma-mangling mremap if we move pages that aren't shared and so we > can adjust the page->index of the pages, instead of creating new vmas > at the dst address with an adjusted vma->vm_pgoff, but I suspect a > syscall that only works on top of fault-unmapped areas is simpler and > safer. mremap semantics requires nuking the dst region before the move > starts. If we would teach mremap how to handle the fault-unmapped > areas we could just add one syscall prepare_fault_area (or whatever > name you choose). > > The locking of doing a vma-less-mremap still sounds tricky but I doubt > you can avoid that locking complexity by using the chardevice as long > as the chardevice backed-memory still allows THP, migration and swap, > if you want to do it atomic-zerocopy and I think zerocopy would be > better especially if the network card is fast and all vcpus are > faulting into unmapped pages simultaneously so triggering heavy amount > of copying from all physical cpus. > > I don't mean the current device driver doing a copy_user won't work or > is bad idea, it's more self contained and maybe easier to merge > upstream. I'm just presenting another option more VM integrated > zerocopy with just 2 syscalls. Zerocopy is really interesting here, esp. w/ RDMA. But while adding ptes is cheap, removing them is not. I wonder if we can make a write-only page? Of course it's unmapped for cpu access, but we can allow DMA write access from the NIC. Probably too wierd. > > vmas must not be involved in the mremap for reliability, or too much > memory could get pinned in vmas even if we temporary lift the > /proc/sys/vm/max_map_count for the process. Plus sending another > signal (not sigsegv or sigbus) should be more reliable in case the > migration crashes for real. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html